1 Introduction

Following in the footsteps of Chester Ismay’s ModernDive. I will use ggformula, however, to create the graphs where needed.

2 Exploring Data

2.1 The nycflights13 Package

flights: Information on all 336,776 flights
airlines: A table matching airline names and their two letter IATA airline codes (also known as carrier codes) for 16 airline companies
planes: Information about each of 3,322 physical aircraft used.
weather: Hourly meteorological data for each of the three NYC airports. This data frame has 26,115 rows, roughly corresponding to the 365 × 24 × 3 = 26,280 possible hourly measurements one can observe at three locations over the course of a year.
airports: Airport names, codes, and locations for 1,458 destination airports.

2.2 Exploring flights

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
head(flights)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …

2.3 Exploring airports

glimpse(airports)
## Rows: 1,458
## Columns: 8
## $ faa   <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", …
## $ name  <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbu…
## $ lat   <dbl> 41.130, 32.461, 41.989, 41.432, 31.074, 36.371, 41.467, 42.884,…
## $ lon   <dbl> -80.620, -85.680, -88.101, -74.392, -81.428, -82.173, -84.507, …
## $ alt   <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1…
## $ tz    <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5,…
## $ dst   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A"…
## $ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Amer…

2.4 The 5NG - Five Named Graphs

ModernDive uses mainly these graphs 1. Scatter Plots 2. Line Graphs 3. Histograms 4. Box Plots 5. Bar Plots

Some of these are appropriate for quantitative and some for categorical variables.

2.4.1 Scatter Plots

  1. dep_delay vs arr_delay

We can plot this as

flights %>% 
  filter(carrier == "AS") %>% 
  gf_point(arr_delay~dep_delay, data = .,alpha = 0.2)
## Warning: Removed 5 rows containing missing values (geom_point).

flights %>% 
  filter(carrier == "AS") %>% 
  gf_jitter(arr_delay~dep_delay, data = .,width = 30,height = 30)
## Warning: Removed 5 rows containing missing values (geom_point).

2.4.2 Line Graphs

Line graphs are commonly used for time series plots We will try to get a sense of the weather dataset.

Use a gf_line graph when the x variable has an inherent ordering.

glimpse(weather)
## Rows: 26,115
## Columns: 15
## $ origin     <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "E…
## $ year       <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
## $ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18,…
## $ temp       <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39…
## $ dewp       <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28…
## $ humid      <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62…
## $ wind_dir   <dbl> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330…
## $ wind_speed <dbl> 10.3570, 8.0555, 11.5078, 12.6586, 12.6586, 11.5078, 14.96…
## $ wind_gust  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20…
## $ precip     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pressure   <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 10…
## $ visib      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ time_hour  <dttm> 2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:0…

We will look at the temp variable at EWR airport, between Jan 1 to Jan 15.

weather %>% 
  filter(origin == "EWR", month == 1,day<=15) %>% 
  gf_line(data = ., temp ~ time_hour)

# If `time_hour` had not been available
weather %>% 
  filter(origin == "EWR", month == 1,day<=15) %>% 
  gf_line(data = ., temp ~ (hour + 24*day))

#LC3.13: Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013.
weather %>% 
  filter(origin == "EWR", month == 1,day<=15) %>% 
  gf_line(data = ., humid ~ time_hour)

2.4.3 Histograms

weather %>% 
  gf_histogram(~temp,bins = 40, color = "white",fill = "steelblue")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

weather %>% 
  gf_histogram(~temp|month,bins = 40)
## Warning: Removed 1 rows containing non-finite values (stat_bin).

### Boxplots

weather %>% 
  gf_boxplot(temp~factor(month),data = .,outlier.color = "red")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Boxplots look like a bunch of Buddhist prayer wheels.

The trick with understanding boxplots is that the shape of the boxplots is based on counts of data and the thing itself is plotted on scales that show amplitudes of data. If a boxplot was plotted on a scale showing counts every boxplot would be symmetric.

2.4.4 Bar Plots

This helps to visualize the distribution of a categorical variable. (No book had really stated this. Good on you, Chester Ismay!). This is a simpler task, as we are simply counting different categories, also known as levels, of a categorical variable. Depending upon whether the categorical variable is pre-counted or not, we choose a different type of bar plot. When the categorical variable whose distribution you want to visualize is:

  • Is not pre-counted in your data frame: use geom_bar().
  • Is pre-counted in your data frame, use geom_col() with the y-position aesthetic mapped to the variable that has the counts.

2.4.4.1 One Categorical Variable

gf_bar(data = flights, ~ carrier) # uncounted

flights_counted <- as_tibble(xtabs(data = flights, ~carrier)) # pre-counted table
gf_col(n ~ carrier, data = flights_counted)

2.4.4.2 Two Categorical Variables

We can use stacked barplots or a dodged barplot to view more than on categorical variable.

gf_bar(data = flights, ~carrier, fill = ~origin) # stacked barplot

gf_bar(data = flights, ~carrier, fill = ~origin, position = "dodge") # dodged barplot

# Facetted Barplot
gf_bar(data = flights, ~carrier|origin~., fill = ~origin)

Dodged barplots are more easy to understand and to compare numbers/counts.

Barplots are the preferred way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories called levels occur. They are easy to understand and make it easy to make comparisons across levels. When trying to visualize two categorical variables, you have many options: stacked barplots, side-by-side barplots, and faceted barplots. Depending on what aspect of the joint distribution you are trying to emphasize, you will need to make a choice between these three types of barplots.

3 Data Wrangling

dplyr functions:
- filter()
- summarize()
- group_by()
- mutate()
- arrange()
- join()

3.1 filter

# Departed from JFK
# Heading to Burlington VT or to Seattle, WA
# Departed in October, November, December
flights %>% 
  filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >=10)
## # A tibble: 815 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10     1      729            735        -6     1049           1040
##  2  2013    10     1      853            900        -7     1217           1157
##  3  2013    10     1      916            925        -9     1016           1033
##  4  2013    10     1     1216           1221        -5     1326           1328
##  5  2013    10     1     1452           1459        -7     1602           1622
##  6  2013    10     1     1459           1500        -1     1817           1829
##  7  2013    10     1     1754           1800        -6     2102           2103
##  8  2013    10     1     1825           1830        -5     2159           2150
##  9  2013    10     1     1925           1930        -5     2227           2250
## 10  2013    10     1     2238           2245        -7     2348           2353
## # … with 805 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# not BTV and not SEA
flights %>% 
  filter(origin == "JFK" & !(dest == "BTV" | dest == "SEA") & month >=10)
## # A tibble: 26,184 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10     1      536            545        -9      809            855
##  2  2013    10     1      539            545        -6      917            933
##  3  2013    10     1      544            550        -6      912            932
##  4  2013    10     1      549            600       -11      653            716
##  5  2013    10     1      553            600        -7      829            856
##  6  2013    10     1      557            600        -3      851            923
##  7  2013    10     1      606            615        -9      744            811
##  8  2013    10     1      621            630        -9      822            857
##  9  2013    10     1      627            630        -3      917            922
## 10  2013    10     1      627            630        -3      809            753
## # … with 26,174 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#or
flights %>% filter(origin == "JFK" & dest !="BTV" & dest !="SEA" & month >=10)
## # A tibble: 26,184 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10     1      536            545        -9      809            855
##  2  2013    10     1      539            545        -6      917            933
##  3  2013    10     1      544            550        -6      912            932
##  4  2013    10     1      549            600       -11      653            716
##  5  2013    10     1      553            600        -7      829            856
##  6  2013    10     1      557            600        -3      851            923
##  7  2013    10     1      606            615        -9      744            811
##  8  2013    10     1      621            630        -9      822            857
##  9  2013    10     1      627            630        -3      917            922
## 10  2013    10     1      627            630        -3      809            753
## # … with 26,174 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# SEA, BTV, PDX, SFO, LAX
flights %>% 
  filter(origin == "JFK" & dest %in% c("BTV", "SFO", "LAX", "PDX", "SEA") & month >=10)
## # A tibble: 5,930 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10     1      557            600        -3      851            923
##  2  2013    10     1      627            630        -3      917            922
##  3  2013    10     1      650            700       -10      929           1003
##  4  2013    10     1      657            700        -3      946           1013
##  5  2013    10     1      658            700        -2      941           1020
##  6  2013    10     1      658            700        -2      947           1025
##  7  2013    10     1      714            715        -1      949           1010
##  8  2013    10     1      729            735        -6     1049           1040
##  9  2013    10     1      739            740        -1     1028           1055
## 10  2013    10     1      739            745        -6     1032           1100
## # … with 5,920 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

filter is the first of the verbs we should use in data wrangling, in order to clean the dataset and narrow the focus to the observations of interest.

I need to work on the scoped versions of the dplyr verbs. Update: 24/04/2020: with dplyr 1.0.0, the across() function can be used in place of scoped verbs.

3.2 summarise

summary_temp<- 
  weather %>% 
  summarise(mean = mean(temp, na.rm = TRUE), sd = sd(temp, na.rm = TRUE), count = n())
summary_temp
## # A tibble: 1 x 3
##    mean    sd count
##   <dbl> <dbl> <int>
## 1  55.3  17.8 26115

3.3 group_by

weather %>% 
  group_by(month) %>%
  summarise(
    mean_temp = mean(temp, na.rm = TRUE),
    sd_temp = sd(temp, na.rm = TRUE),
    count = n()
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 4
##    month mean_temp sd_temp count
##    <int>     <dbl>   <dbl> <int>
##  1     1      35.6   10.2   2226
##  2     2      34.3    6.98  2010
##  3     3      39.9    6.25  2227
##  4     4      51.7    8.79  2159
##  5     5      61.8    9.68  2232
##  6     6      72.2    7.55  2160
##  7     7      80.1    7.12  2228
##  8     8      74.5    5.19  2217
##  9     9      67.4    8.47  2159
## 10    10      60.1    8.85  2212
## 11    11      45.0   10.4   2141
## 12    12      38.4    9.98  2144
flights %>% 
  group_by(origin) %>% 
  summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   origin  count
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

3.3.1 Grouping by more than one variable

by_origin_monthly <- 
  weather %>% 
  group_by(origin, month) %>% 
  summarise(count = n())
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
by_origin_monthly
## # A tibble: 36 x 3
## # Groups:   origin [3]
##    origin month count
##    <chr>  <int> <int>
##  1 EWR        1   742
##  2 EWR        2   669
##  3 EWR        3   743
##  4 EWR        4   720
##  5 EWR        5   744
##  6 EWR        6   720
##  7 EWR        7   741
##  8 EWR        8   740
##  9 EWR        9   719
## 10 EWR       10   736
## # … with 26 more rows
gf_col(by_origin_monthly,count~month, fill = ~origin, position = "dodge")

#LC4.6: Mean and SD of temp for each day
weather %>% 
  group_by(month,day) %>% 
  summarise(count = n(), mean_temp = mean(temp, na.rm = TRUE), sd_temp= sd(temp, na.rm = TRUE))
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
## # A tibble: 364 x 5
## # Groups:   month [12]
##    month   day count mean_temp sd_temp
##    <int> <int> <int>     <dbl>   <dbl>
##  1     1     1    67      37.0    4.00
##  2     1     2    72      28.7    3.45
##  3     1     3    72      30.0    2.58
##  4     1     4    72      34.9    2.45
##  5     1     5    72      37.2    4.01
##  6     1     6    71      40.1    4.40
##  7     1     7    72      40.6    3.68
##  8     1     8    72      40.1    5.77
##  9     1     9    72      43.2    5.40
## 10     1    10    72      43.8    2.95
## # … with 354 more rows
#LC4.7 `group_by(month, origin)` vs `group_by(origin,month)`

by_origin_monthly <- 
  flights %>% 
  group_by(origin, month) %>% 
  summarise(count = n())
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
by_origin_monthly
## # A tibble: 36 x 3
## # Groups:   origin [3]
##    origin month count
##    <chr>  <int> <int>
##  1 EWR        1  9893
##  2 EWR        2  9107
##  3 EWR        3 10420
##  4 EWR        4 10531
##  5 EWR        5 10592
##  6 EWR        6 10175
##  7 EWR        7 10475
##  8 EWR        8 10359
##  9 EWR        9  9550
## 10 EWR       10 10104
## # … with 26 more rows
by_month_origin <- 
  flights %>% 
  group_by(month,origin) %>% 
  summarise(count = n())
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
by_month_origin
## # A tibble: 36 x 3
## # Groups:   month [12]
##    month origin count
##    <int> <chr>  <int>
##  1     1 EWR     9893
##  2     1 JFK     9161
##  3     1 LGA     7950
##  4     2 EWR     9107
##  5     2 JFK     8421
##  6     2 LGA     7423
##  7     3 EWR    10420
##  8     3 JFK     9697
##  9     3 LGA     8717
## 10     4 EWR    10531
## # … with 26 more rows
#LC4.8 n(flights) from each airport for each carrier
flights %>% 
  group_by(origin,carrier) %>% 
  summarise(count = n())
## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
## # A tibble: 35 x 3
## # Groups:   origin [3]
##    origin carrier count
##    <chr>  <chr>   <int>
##  1 EWR    9E       1268
##  2 EWR    AA       3487
##  3 EWR    AS        714
##  4 EWR    B6       6557
##  5 EWR    DL       4342
##  6 EWR    EV      43939
##  7 EWR    MQ       2276
##  8 EWR    OO          6
##  9 EWR    UA      46087
## 10 EWR    US       4405
## # … with 25 more rows

3.4 mutate

As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames. So when creating new variables (columns) with mutate, we may safely overwrite the original data_frame.

weather <- 
  weather %>% 
  mutate(temp_in_c = (temp-32)/1.8)
weather
## # A tibble: 26,115 x 16
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # … with 26,105 more rows, and 6 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>, temp_in_c <dbl>
#Let’s compute average monthly temperatures in both °F and °C using the similar group_by() and summarize() code as in the previous section.

weather %>% 
  group_by(month) %>% 
  summarise(mean_F = mean(temp, na.rm = TRUE), mean_C = mean(temp_in_c, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 3
##    month mean_F mean_C
##    <int>  <dbl>  <dbl>
##  1     1   35.6   2.02
##  2     2   34.3   1.26
##  3     3   39.9   4.38
##  4     4   51.7  11.0 
##  5     5   61.8  16.6 
##  6     6   72.2  22.3 
##  7     7   80.1  26.7 
##  8     8   74.5  23.6 
##  9     9   67.4  19.7 
## 10    10   60.1  15.6 
## 11    11   45.0   7.22
## 12    12   38.4   3.58
#Let's see how many flights "gain" time.
flights <- 
  flights %>% 
  mutate(gain = dep_delay - arr_delay)
flights
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   gain <dbl>
flights %>% 
  summarise(
  min = min(gain,na.rm = TRUE),
  median = median(gain,na.rm = TRUE),
  q3 = quantile(gain, 0.75, na.rm = TRUE),
  max = max(gain,na.rm = TRUE),
    mean = mean(gain,na.rm = TRUE),
  sd = sd(gain,na.rm = TRUE)
)
## # A tibble: 1 x 6
##     min median    q3   max  mean    sd
##   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  -196      7    17   109  5.66  18.0
gf_histogram(flights, ~gain,color = "white",bins = 20)
## Warning: Removed 9430 rows containing non-finite values (stat_bin).

3.5 arrange

Sorting of rows in a data_frame based on alphanumeric values of some variable/column. In other words, arrange() sorts in ascending order by default unless you override this default behavior by using desc().

freq_dest <- 
  flights %>% 
  group_by(dest) %>% 
  summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
freq_dest
## # A tibble: 105 x 2
##    dest  count
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     265
##  3 ALB     439
##  4 ANC       8
##  5 ATL   17215
##  6 AUS    2439
##  7 AVL     275
##  8 BDL     443
##  9 BGR     375
## 10 BHM     297
## # … with 95 more rows
freq_dest <- 
  flights %>% 
  group_by(dest) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))
## `summarise()` ungrouping output (override with `.groups` argument)
freq_dest
## # A tibble: 105 x 2
##    dest  count
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # … with 95 more rows

3.6 joining data_frames

We can join datasets using various kinds of join operations. This is done using key variables to match values across data_frames.

glimpse(airlines)
## Rows: 16
## Columns: 2
## $ carrier <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "…
## $ name    <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airlin…
#Getting `carrier` info into `flights`

flights_joined <- 
  flights %>% 
  inner_join(airlines, by = "carrier")
glimpse(flights)
## Rows: 336,776
## Columns: 20
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
## $ gain           <dbl> -9, -16, -31, 17, 19, -16, -24, 11, 5, -10, 0, 1, -9, …
glimpse(flights_joined)
## Rows: 336,776
## Columns: 21
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, …
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
## $ gain           <dbl> -9, -16, -31, 17, 19, -16, -24, 11, 5, -10, 0, 1, -9, …
## $ name           <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Ame…

Say instead we are interested in the destinations of all domestic flights departing NYC in 2013 and ask yourself:

  • “What cities are these airports in?”
  • Is “ORD” Orlando?”
  • “Where is”FLL"?
glimpse(airports)
## Rows: 1,458
## Columns: 8
## $ faa   <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", …
## $ name  <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbu…
## $ lat   <dbl> 41.130, 32.461, 41.989, 41.432, 31.074, 36.371, 41.467, 42.884,…
## $ lon   <dbl> -80.620, -85.680, -88.101, -74.392, -81.428, -82.173, -84.507, …
## $ alt   <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 1…
## $ tz    <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5,…
## $ dst   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A"…
## $ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Amer…
flights_with_airport_names <- 
  flights %>% inner_join(airports, by = c("dest" = "faa"))
glimpse(flights_with_airport_names)
## Rows: 329,174
## Columns: 27
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, 554, 554, 555, 557, 557, 558, 558, 558,…
## $ sched_dep_time <int> 515, 529, 540, 600, 558, 600, 600, 600, 600, 600, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0…
## $ arr_time       <int> 830, 850, 923, 812, 740, 913, 709, 838, 753, 849, 853,…
## $ sched_arr_time <int> 819, 830, 850, 837, 728, 854, 723, 846, 745, 851, 856,…
## $ arr_delay      <dbl> 11, 20, 33, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 3…
## $ carrier        <chr> "UA", "UA", "AA", "DL", "UA", "B6", "EV", "B6", "AA", …
## $ flight         <int> 1545, 1714, 1141, 461, 1696, 507, 5708, 79, 301, 49, 7…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N668DN", "N39463", "N51…
## $ origin         <chr> "EWR", "LGA", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"…
## $ dest           <chr> "IAH", "IAH", "MIA", "ATL", "ORD", "FLL", "IAD", "MCO"…
## $ air_time       <dbl> 227, 227, 160, 116, 150, 158, 53, 140, 138, 149, 158, …
## $ distance       <dbl> 1400, 1416, 1089, 762, 719, 1065, 229, 944, 733, 1028,…
## $ hour           <dbl> 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, …
## $ minute         <dbl> 15, 29, 40, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 …
## $ gain           <dbl> -9, -16, -31, 19, -16, -24, 11, 5, -10, 0, 1, -9, 12, …
## $ name           <chr> "George Bush Intercontinental", "George Bush Intercont…
## $ lat            <dbl> 29.984, 29.984, 25.793, 33.637, 41.979, 26.073, 38.945…
## $ lon            <dbl> -95.341, -95.341, -80.291, -84.428, -87.905, -80.153, …
## $ alt            <dbl> 97, 97, 8, 1026, 668, 9, 313, 96, 668, 19, 26, 126, 13…
## $ tz             <dbl> -6, -6, -5, -5, -6, -5, -5, -5, -6, -5, -5, -8, -8, -6…
## $ dst            <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",…
## $ tzone          <chr> "America/Chicago", "America/Chicago", "America/New_Yor…
named_dests <- 
  flights_with_airport_names %>%
  group_by(dest) %>% 
  summarise(num_flights = n()) %>% 
  arrange(desc(num_flights)) %>% 
  inner_join(airports, by = c("dest"= "faa")) %>% 
  rename(airport_name = "name")
## `summarise()` ungrouping output (override with `.groups` argument)
named_dests
## # A tibble: 101 x 9
##    dest  num_flights airport_name          lat    lon   alt    tz dst   tzone   
##    <chr>       <int> <chr>               <dbl>  <dbl> <dbl> <dbl> <chr> <chr>   
##  1 ORD         17283 Chicago Ohare Intl   42.0  -87.9   668    -6 A     America…
##  2 ATL         17215 Hartsfield Jackson…  33.6  -84.4  1026    -5 A     America…
##  3 LAX         16174 Los Angeles Intl     33.9 -118.    126    -8 A     America…
##  4 BOS         15508 General Edward Law…  42.4  -71.0    19    -5 A     America…
##  5 MCO         14082 Orlando Intl         28.4  -81.3    96    -5 A     America…
##  6 CLT         14064 Charlotte Douglas …  35.2  -80.9   748    -5 A     America…
##  7 SFO         13331 San Francisco Intl   37.6 -122.     13    -8 A     America…
##  8 FLL         12055 Fort Lauderdale Ho…  26.1  -80.2     9    -5 A     America…
##  9 MIA         11728 Miami Intl           25.8  -80.3     8    -5 A     America…
## 10 DCA          9705 Ronald Reagan Wash…  38.9  -77.0    15    -5 A     America…
## # … with 91 more rows

3.6.1 Using multiple key variables to join

flights_weather_joined <- flights %>% 
  inner_join(weather, by = c("year", "month", "day", "hour","origin"))
flights_weather_joined
## # A tibble: 335,220 x 31
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 335,210 more rows, and 23 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour.x <dttm>, gain <dbl>, temp <dbl>, dewp <dbl>, humid <dbl>,
## #   wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour.y <dttm>, temp_in_c <dbl>

3.6.2 LC4.20 Neat Problem

Compute the available seat miles for each airline sorted in descending order.

# compute the available seat miles for each airline sorted in descending order. 
seatmiles <- flights %>% 
  inner_join(airlines,by = "carrier") %>% 
  inner_join(planes, by = "tailnum") %>% 
# flights with NO tailnum did not operate !
# check the following:
# flights %>% filter(is.na(tailnum) & !is.na(dep_time))
  select(carrier, name, distance, seats) %>% 
  mutate(seatmiles = seats*distance) %>% 
  group_by(carrier, name) %>% 
  summarise(total_seat_miles_million = sum(seatmiles)/1000000) %>% 
  arrange(desc(total_seat_miles_million))
## `summarise()` regrouping output by 'carrier' (override with `.groups` argument)
seatmiles
## # A tibble: 16 x 3
## # Groups:   carrier [16]
##    carrier name                        total_seat_miles_million
##    <chr>   <chr>                                          <dbl>
##  1 UA      United Air Lines Inc.                       15516.  
##  2 DL      Delta Air Lines Inc.                        10533.  
##  3 B6      JetBlue Airways                              9618.  
##  4 AA      American Airlines Inc.                       3677.  
##  5 US      US Airways Inc.                              2534.  
##  6 VX      Virgin America                               2297.  
##  7 EV      ExpressJet Airlines Inc.                     1817.  
##  8 WN      Southwest Airlines Co.                       1718.  
##  9 9E      Endeavor Air Inc.                             777.  
## 10 HA      Hawaiian Airlines Inc.                        642.  
## 11 AS      Alaska Airlines Inc.                          314.  
## 12 FL      AirTran Airways Corporation                   220.  
## 13 F9      Frontier Airlines Inc.                        185.  
## 14 YV      Mesa Airlines Inc.                             20.2 
## 15 MQ      Envoy Air                                       7.16
## 16 OO      SkyWest Airlines Inc.                           1.30

4 Data Importing and tidy data

# "Levels" of democracy
dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   `1952` = col_double(),
##   `1957` = col_double(),
##   `1962` = col_double(),
##   `1967` = col_double(),
##   `1972` = col_double(),
##   `1977` = col_double(),
##   `1982` = col_double(),
##   `1987` = col_double(),
##   `1992` = col_double()
## )
dem_score
## # A tibble: 96 x 10
##    country    `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
##    <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Albania        -9     -9     -9     -9     -9     -9     -9     -9      5
##  2 Argentina      -9     -1     -1     -9     -9     -9     -8      8      7
##  3 Armenia        -9     -7     -7     -7     -7     -7     -7     -7      7
##  4 Australia      10     10     10     10     10     10     10     10     10
##  5 Austria        10     10     10     10     10     10     10     10     10
##  6 Azerbaijan     -9     -7     -7     -7     -7     -7     -7     -7      1
##  7 Belarus        -9     -7     -7     -7     -7     -7     -7     -7      7
##  8 Belgium        10     10     10     10     10     10     10     10     10
##  9 Bhutan        -10    -10    -10    -10    -10    -10    -10    -10    -10
## 10 Bolivia        -4     -3     -3     -4     -7     -7      8      9      9
## # … with 86 more rows

In this dem_score data frame, the minimum value of -10 corresponds to a highly autocratic nation whereas a value of 10 corresponds to a highly democratic nation.

4.1 Analysing drinks data from fivethirtyeight

drinks
## # A tibble: 193 x 5
##    country      beer_servings spirit_servings wine_servings total_litres_of_pur…
##    <chr>                <int>           <int>         <int>                <dbl>
##  1 Afghanistan              0               0             0                  0  
##  2 Albania                 89             132            54                  4.9
##  3 Algeria                 25               0            14                  0.7
##  4 Andorra                245             138           312                 12.4
##  5 Angola                 217              57            45                  5.9
##  6 Antigua & B…           102             128            45                  4.9
##  7 Argentina              193              25           221                  8.3
##  8 Armenia                 21             179            11                  3.8
##  9 Australia              261              72           212                 10.4
## 10 Austria                279              75           191                  9.7
## # … with 183 more rows
drinks_smaller <- drinks %>% 
  filter(country %in% c("USA","China","Italy","Saudi Arabia")) %>% 
  select(-total_litres_of_pure_alcohol) %>% 
  rename(beer = beer_servings,
         spirit = spirit_servings,
         wine = wine_servings)
drinks_smaller
## # A tibble: 4 x 4
##   country       beer spirit  wine
##   <chr>        <int>  <int> <int>
## 1 China           79    192     8
## 2 Italy           85     42   237
## 3 Saudi Arabia     0      5     0
## 4 USA            249    158    84

This code below does not work…so we need to tidy the data

# gf_bar(drinks_smaller,~beer, fill = ~country,width = 7,position = "dodge") %>% 
# gf_bar(drinks_smaller,~spirit, fill = ~country,width = 7,position = "dodge") %>% 
# gf_bar(drinks_smaller,~wine, fill = ~country,width = 7,position = "dodge")

# Can do this ordinary thing
gf_point(beer~reorder(country,beer), data = drinks_smaller,xlab = "Country") %>% gf_refine(coord_flip())

So we need to gather this data into narrow form

drinks_smaller_tidy <- 
  drinks_smaller %>% 
  pivot_longer(., cols = c(beer,spirit,wine),
    names_to = "type_of_drink",
    values_to = "servings")
drinks_smaller_tidy
## # A tibble: 12 x 3
##    country      type_of_drink servings
##    <chr>        <chr>            <int>
##  1 China        beer                79
##  2 China        spirit             192
##  3 China        wine                 8
##  4 Italy        beer                85
##  5 Italy        spirit              42
##  6 Italy        wine               237
##  7 Saudi Arabia beer                 0
##  8 Saudi Arabia spirit               5
##  9 Saudi Arabia wine                 0
## 10 USA          beer               249
## 11 USA          spirit             158
## 12 USA          wine                84
gf_col(drinks_smaller_tidy,servings ~ country, 
       fill = ~ type_of_drink,color = "black",
       position = "dodge")

So wide data is good for looking at and eyeballing; Narrow data is essential for plotting and not really good for looking at, with its repeated names in the key coloumn.

4.1.1 Problem LC5.3

airline_safety_smaller <- 
  airline_safety %>% 
  select(-c(incl_reg_subsidiaries,avail_seat_km_per_week))
airline_safety_smaller
## # A tibble: 56 x 7
##    airline incidents_85_99 fatal_accidents… fatalities_85_99 incidents_00_14
##    <chr>             <int>            <int>            <int>           <int>
##  1 Aer Li…               2                0                0               0
##  2 Aerofl…              76               14              128               6
##  3 Aeroli…               6                0                0               1
##  4 Aerome…               3                1               64               5
##  5 Air Ca…               2                0                0               2
##  6 Air Fr…              14                4               79               6
##  7 Air In…               2                1              329               4
##  8 Air Ne…               3                0                0               5
##  9 Alaska…               5                0                0               5
## 10 Alital…               7                2               50               4
## # … with 46 more rows, and 2 more variables: fatal_accidents_00_14 <int>,
## #   fatalities_00_14 <int>
# How to tidy up this dataset? Here is the old code:

# airline_safety_smaller_tidy <- 
#   airline_safety_smaller %>% 
#   gather(key = "incident_type_years", # What to call the stacked up column headers
#          value = "count", # What to call the stacked up the readings
#          ... = -airline) # Where to (not) stack from !!

# airline_safety_smaller_tidy

# New code

airline_safety_smaller_tidy <- 
  airline_safety_smaller %>% 
  pivot_longer(data = .,cols = c(-airline), 
    names_to = "incident_type_years", 
    values_to = "count")
airline_safety_smaller_tidy
## # A tibble: 336 x 3
##    airline    incident_type_years   count
##    <chr>      <chr>                 <int>
##  1 Aer Lingus incidents_85_99           2
##  2 Aer Lingus fatal_accidents_85_99     0
##  3 Aer Lingus fatalities_85_99          0
##  4 Aer Lingus incidents_00_14           0
##  5 Aer Lingus fatal_accidents_00_14     0
##  6 Aer Lingus fatalities_00_14          0
##  7 Aeroflot   incidents_85_99          76
##  8 Aeroflot   fatal_accidents_85_99    14
##  9 Aeroflot   fatalities_85_99        128
## 10 Aeroflot   incidents_00_14           6
## # … with 326 more rows
gf_col(airline_safety_smaller_tidy,
      count~airline,fill = ~incident_type_years,size = 8) %>% 
  gf_refine(coord_flip()) %>% 
  gf_theme(theme_minimal()) %>% 
  gf_theme(axis.text.y = 
             element_text(hjust = 1.0, color = "navy", size = 5),
           legend.text = element_text(size = 8),
           legend.title = element_text(size = 10),
           legend.background = element_rect(fill = "grey"))

# Can use `reorder` to plot these and get the Dustin Hoffman `Rain Man` plot. (Quantas has had no accidents)

gf_col(airline_safety_smaller_tidy,
      count ~ reorder(airline,count),fill = ~incident_type_years,size = 8, title = "Dustin Hoffman Rain Man plot", subtitle = "Quantas has had no accidents") %>% 
  gf_refine(coord_flip()) %>% 
  gf_theme(theme_minimal()) %>% 
  gf_theme(axis.text.y = 
             element_text(hjust = 1.0, color = "navy", size = 5),
           legend.text = element_text(size = 8),
           legend.title = element_text(size = 10),
           legend.background = element_rect(fill = "grey"))

5 Basic Regression

The fundamental premise of data modeling is to make explicit the relationship between:
- an outcome variable y, also called a dependent variable and
- an explanatory/predictor variable x, also called an independent variable or covariate.

Linear Regression is when y is numerical or quantitative; and x is either numerical or categorical. x is called the independent variable or also the covariate.

5.1 One numerical explanatory variable

library(moderndive)
library(skimr)
library(gapminder)

# We use `evals`, the teacher evaluation score dataset
moderndive::evals
## # A tibble: 463 x 14
##       ID prof_ID score   age bty_avg gender ethnicity language rank  pic_outfit
##    <int>   <int> <dbl> <int>   <dbl> <fct>  <fct>     <fct>    <fct> <fct>     
##  1     1       1   4.7    36    5    female minority  english  tenu… not formal
##  2     2       1   4.1    36    5    female minority  english  tenu… not formal
##  3     3       1   3.9    36    5    female minority  english  tenu… not formal
##  4     4       1   4.8    36    5    female minority  english  tenu… not formal
##  5     5       2   4.6    59    3    male   not mino… english  tenu… not formal
##  6     6       2   4.3    59    3    male   not mino… english  tenu… not formal
##  7     7       2   2.8    59    3    male   not mino… english  tenu… not formal
##  8     8       3   4.1    51    3.33 male   not mino… english  tenu… not formal
##  9     9       3   3.4    51    3.33 male   not mino… english  tenu… not formal
## 10    10       4   4.5    40    3.17 female not mino… english  tenu… not formal
## # … with 453 more rows, and 4 more variables: pic_color <fct>,
## #   cls_did_eval <int>, cls_students <int>, cls_level <fct>
evals_ch6 <- evals %>% 
  select(score,bty_avg,age)
glimpse(evals_ch6)
## Rows: 463
## Columns: 3
## $ score   <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3.333…
## $ age     <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…
evals_ch6 %>% select(score, bty_avg) %>% 
  skim()
Data summary
Name Piped data
Number of rows 463
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
score 0 1 4.17 0.54 2.30 3.80 4.30 4.6 5.00 ▁▁▅▇▇
bty_avg 0 1 4.42 1.53 1.67 3.17 4.33 5.5 8.17 ▃▇▇▃▂

5.1.1 Calculating correlations using moderndive

evals_ch6 %>% 
  get_correlation(score~bty_avg)
## # A tibble: 1 x 1
##     cor
##   <dbl>
## 1 0.187
gf_point(evals_ch6,score~bty_avg) %>% 
  # Note that gf_jitter would reveal the overplotted points
  gf_smooth(method = "lm")

Seems like a low-value positive correlation between the beauty score and the teaching score.

5.1.2 Linear Model

score_model <- lm(score ~ bty_avg, data = evals_ch6)

# Regression Table from `moderndive`
moderndive::get_regression_table(score_model)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099

The model seems to make the correlation value even more tepid. (0.067 compared with 0.187)

5.1.3 LC6.1,LC6.2 Teaching Score vs Age

#LC6.1 - Exploration
gf_point(evals_ch6,score~age) %>% 
  gf_smooth(method = "lm")

evals_ch6 %>% get_correlation(score ~age)
## # A tibble: 1 x 1
##      cor
##    <dbl>
## 1 -0.107
#LC6.2 Model
model_age <- lm(score~age, evals_ch6)
get_regression_table(model_age)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    4.46      0.127     35.2    0        4.21     4.71 
## 2 age         -0.006     0.003     -2.31   0.021   -0.011   -0.001
gf_point(evals_ch6,score ~ age) %>% 
  gf_smooth(method = "lm",se = TRUE)

The teaching score is also slightly negatively correlated withage.

5.1.4 Observed and Fitted values, Residuals

score_model
## 
## Call:
## lm(formula = score ~ bty_avg, data = evals_ch6)
## 
## Coefficients:
## (Intercept)      bty_avg  
##      3.8803       0.0666
reg_points <- get_regression_points(score_model)
reg_points
## # A tibble: 463 x 5
##       ID score bty_avg score_hat residual
##    <int> <dbl>   <dbl>     <dbl>    <dbl>
##  1     1   4.7    5         4.21    0.486
##  2     2   4.1    5         4.21   -0.114
##  3     3   3.9    5         4.21   -0.314
##  4     4   4.8    5         4.21    0.586
##  5     5   4.6    3         4.08    0.52 
##  6     6   4.3    3         4.08    0.22 
##  7     7   2.8    3         4.08   -1.28 
##  8     8   4.1    3.33      4.10   -0.002
##  9     9   3.4    3.33      4.10   -0.702
## 10    10   4.5    3.17      4.09    0.409
## # … with 453 more rows
# Trying to plot residuals: ggformula strangely does not work here
# It will work if we use the `fomrula` interface for all parameters, even numerically declared ones such as intercepts and slopes

gf_jitter(data = evals_ch6,score ~ bty_avg,height = 0) %>%
gf_abline(intercept = ~ 4.462, slope = ~ -0.006, color = "blue") %>%
  gf_segment(score_hat + score ~ bty_avg + bty_avg, data = reg_points,color = "red")

# Ha! Don't do the `tilde` thing for `ggplot`!!
ggplot(evals_ch6, aes(x = bty_avg, y = score))+
  geom_jitter(height = 0)+
  geom_abline(intercept = 4.462, slope = -0.006, color = "blue")+
  geom_segment(aes(y = score_hat,yend = score, x = bty_avg, xend = bty_avg), data = reg_points,color = "red")

5.2 One Categorical Explanatory variable

In this section, we’ll use the gapminder dataset to explore differences in life expectancy in two ways:

1. Differences between continents: Are there significant differences in life expectancy, on average, between the five continents of the world: Africa, the Americas, Asia, Europe, and Oceania?

2. Differences within continents: How does life expectancy vary within the world’s five continents? For example, is the spread of life expectancy among the countries of Africa larger than the spread of life expectancy among the countries of Asia? ( This is called verbatim text in RMarkdown, achieved with indentation of 4 spaces)

5.2.1 Gapminder

# Exploratory Data Analysis
library(gapminder)
gapminder2007 <- gapminder %>%  filter(year == "2007") %>% select(country, continent, lifeExp, gdpPercap)
glimpse(gapminder2007)
## Rows: 142
## Columns: 4
## $ country   <fct> Afghanistan, Albania, Algeria, Angola, Argentina, Australia…
## $ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As…
## $ lifeExp   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 75.…
## $ gdpPercap <dbl> 974.58, 5937.03, 6223.37, 4797.23, 12779.38, 34435.37, 3612…
#Further Explorations 1
gapminder2007 %>% skim()
Data summary
Name Piped data
Number of rows 142
Number of columns 4
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1 FALSE 142 Afg: 1, Alb: 1, Alg: 1, Ang: 1
continent 0 1 FALSE 5 Afr: 52, Asi: 33, Eur: 30, Ame: 25

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lifeExp 0 1 67.01 12.07 39.61 57.16 71.94 76.41 82.6 ▂▃▃▆▇
gdpPercap 0 1 11680.07 12859.94 277.55 1624.84 6124.37 18008.84 49357.2 ▇▂▁▂▁
gf_histogram(data = gapminder2007,~lifeExp)

lifeExp is negatively skewed: there are many countries who have lower-than-mean Life Expectancy.

median(gapminder2007$lifeExp)
## [1] 71.935
gapminder2007 %>% group_by(continent) %>% 
  summarise(mean = mean(lifeExp), median = median(lifeExp), sd = sd(lifeExp), n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 5
##   continent  mean median    sd     n
##   <fct>     <dbl>  <dbl> <dbl> <int>
## 1 Africa     54.8   52.9 9.63     52
## 2 Americas   73.6   72.9 4.44     25
## 3 Asia       70.7   72.4 7.96     33
## 4 Europe     77.6   78.6 2.98     30
## 5 Oceania    80.7   80.7 0.729     2
#Visualise this
gapminder2007 %>% gf_histogram(~lifeExp|continent, ylab = "Number of Countries",xlab = "Life Expectancy in years")

gf_boxplot(data = gapminder2007, lifeExp~continent, title = " Life Expectancy per Continent compared to Worldwide") %>% 
  gf_hline(yintercept = ~ median(gapminder2007$lifeExp),color = "blue")
## Warning: Use of `gapminder2007$lifeExp` is discouraged. Use `lifeExp` instead.

gapminder2007 %>% filter(lifeExp<=50) # Hard life, guys!
## # A tibble: 19 x 4
##    country                  continent lifeExp gdpPercap
##    <fct>                    <fct>       <dbl>     <dbl>
##  1 Afghanistan              Asia         43.8      975.
##  2 Angola                   Africa       42.7     4797.
##  3 Burundi                  Africa       49.6      430.
##  4 Central African Republic Africa       44.7      706.
##  5 Congo, Dem. Rep.         Africa       46.5      278.
##  6 Cote d'Ivoire            Africa       48.3     1545.
##  7 Guinea-Bissau            Africa       46.4      579.
##  8 Lesotho                  Africa       42.6     1569.
##  9 Liberia                  Africa       45.7      415.
## 10 Malawi                   Africa       48.3      759.
## 11 Mozambique               Africa       42.1      824.
## 12 Nigeria                  Africa       46.9     2014.
## 13 Rwanda                   Africa       46.2      863.
## 14 Sierra Leone             Africa       42.6      863.
## 15 Somalia                  Africa       48.2      926.
## 16 South Africa             Africa       49.3     9270.
## 17 Swaziland                Africa       39.6     4513.
## 18 Zambia                   Africa       42.4     1271.
## 19 Zimbabwe                 Africa       43.5      470.

5.2.2 LC6.3 Gapminder gdpPercap vs continent

#Histogram
gf_histogram(~gdpPercap, data = gapminder2007)

#Boxplot
gf_boxplot(data = gapminder2007,gdpPercap~continent)

# fivenumbers for gdpPercap
gapminder2007 %>% group_by(continent) %>% 
  summarise(mean = mean(gdpPercap), median = median(gdpPercap),n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 4
##   continent   mean median     n
##   <fct>      <dbl>  <dbl> <int>
## 1 Africa     3089.  1452.    52
## 2 Americas  11003.  8948.    25
## 3 Asia      12473.  4471.    33
## 4 Europe    25054. 28054.    30
## 5 Oceania   29810. 29810.     2
gapminder2007 %>% filter(gdpPercap>=45000) # Rich guys!
## # A tibble: 3 x 4
##   country   continent lifeExp gdpPercap
##   <fct>     <fct>       <dbl>     <dbl>
## 1 Kuwait    Asia         77.6    47307.
## 2 Norway    Europe       80.2    49357.
## 3 Singapore Asia         80.0    47143.

5.2.3 Linear Regression with one categorical explanatory variable

lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
get_regression_table(lifeExp_model)
## # A tibble: 5 x 7
##   term              estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept             54.8      1.02     53.4        0     52.8     56.8
## 2 continentAmericas     18.8      1.8      10.4        0     15.2     22.4
## 3 continentAsia         15.9      1.65      9.68       0     12.7     19.2
## 4 continentEurope       22.8      1.70     13.5        0     19.5     26.2
## 5 continentOceania      25.9      5.33      4.86       0     15.4     36.4

The model equation becomes

\[ \widehat{lifeExp} = b_0 +b_{Amer}* 1_{Amer}(x) + b_{Asia} * 1_{Asia}(y)\\ +b_{Europe} * 1_{Europe}(z) + b_{Oceania} * 1_{Oceania}(p) \]

\[ where\\ \pmb{b_0} = intercept\\ \pmb{b_{Amer}} = Modelling \ Coefficient \ for \ American \ countries \ etc.\\ \pmb{1_{Amer}}(x) = Binary\ Categorical\ Variable\ to\\ indicate\ if\ a\ country\ is\ in\ the\ Americas\ etc.\ \]

Since Africa was alphabetically the first of the continents, it becomes the baseline and hence the intercept is mean(lifeExp)@Africa). We can change this using the forcats package, for example. The $1_{Asia}$ coefficients are binary, and take values 1 or 0 depending on whether the selected country is in that continent.

5.2.4 LC6.4 Gapminder2007 gdpPercap vs continent

# From LC6.3 above
#Histogram
gf_histogram(~gdpPercap, data = gapminder2007)

#Boxplot
gf_boxplot(data = gapminder2007,gdpPercap~continent)

# Model
gdp_model <- lm(gdpPercap ~ continent, data = gapminder2007)
get_regression_table(gdp_model)
## # A tibble: 5 x 7
##   term              estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept            3089.     1373.      2.25   0.026     375.    5804.
## 2 continentAmericas    7914.     2409.      3.28   0.001    3150.   12678.
## 3 continentAsia        9384.     2203.      4.26   0        5027.   13741.
## 4 continentEurope     21965.     2270.      9.68   0       17478.   26453.
## 5 continentOceania    26721.     7133.      3.75   0       12616.   40826.

6 Multiple Regression

library(ISLR)

We will use the Credit dataset from the book ISLR to demonstrate multiple regression with:

A numerical outcome variable y, in this case credit card `Balance`.
Two explanatory variables:
- A first numerical explanatory variable x1, in this case, their credit `Limit`.
- A second numerical explanatory variable x2, in this case, their `Income` (in thousands of dollars).

6.1 Exploratory Data Analysis

# Look at the data
ISLR::Credit %>% glimpse()
## Rows: 400
## Columns: 12
## $ ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, …
## $ Limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819,…
## $ Rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138,…
## $ Cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2,…
## $ Age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75,…
## $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15…
## $ Gender    <fct>  Male, Female,  Male, Female,  Male,  Male, Female,  Male, …
## $ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, N…
## $ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Y…
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, Afric…
## $ Balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0…
Credit %>% 
  select(Balance, Limit, Rating, Age, Income) %>% 
  skim()
Data summary
Name Piped data
Number of rows 400
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Balance 0 1 520.02 459.76 0.00 68.75 459.50 863.00 1999.00 ▇▅▃▂▁
Limit 0 1 4735.60 2308.20 855.00 3088.00 4622.50 5872.75 13913.00 ▆▇▃▁▁
Rating 0 1 354.94 154.72 93.00 247.25 344.00 437.25 982.00 ▆▇▃▁▁
Age 0 1 55.67 17.25 23.00 41.75 56.00 70.00 98.00 ▆▇▇▇▁
Income 0 1 45.22 35.24 10.35 21.01 33.12 57.47 186.63 ▇▂▁▁▁
# Can do this simultaneously too
# `cor` takes two numerical vectors or takes a dataframe of numerical vectors for pairwise correlation
Credit %>% 
  select(Balance,Limit, Income) %>% 
  cor()
##         Balance   Limit  Income
## Balance 1.00000 0.86170 0.46366
## Limit   0.86170 1.00000 0.79209
## Income  0.46366 0.79209 1.00000
# Correlation Coefficients
get_correlation(Balance ~ Limit, data = Credit)
##      cor
## 1 0.8617
get_correlation(Balance ~ Income, data = Credit)
##       cor
## 1 0.46366
# Balance vs Income
gf_point(Balance ~ Income, data = Credit,title = "Balance is strongly correlated with Income") %>% 
  gf_smooth(method = "lm")

#Balance vs Limit
gf_point(Balance ~ Limit, data = Credit,title = "Balance is also strongly correlated with limit") %>% 
  gf_smooth(method = "lm")

The best fitting regression line with two explanatory variables will be a plane. We need to use plotly to plot the regression plane.

Note that Income and Limit, our explanatory variables, are also correlated, leading to the problem of multicollinearity.

6.1.1 Problem LC7.1 Balance vs Rating and Age

#EDA
# Look at the data
Credit %>% 
  select(Balance, Age, Rating) %>% 
  glimpse()
## Rows: 400
## Columns: 3
## $ Balance <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0, …
## $ Age     <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, 5…
## $ Rating  <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, 3…
# Summary Statistics
Credit %>% 
  select(Balance, Age, Rating) %>% 
  skim()
Data summary
Name Piped data
Number of rows 400
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Balance 0 1 520.02 459.76 0 68.75 459.5 863.00 1999 ▇▅▃▂▁
Age 0 1 55.67 17.25 23 41.75 56.0 70.00 98 ▆▇▇▇▁
Rating 0 1 354.94 154.72 93 247.25 344.0 437.25 982 ▆▇▃▁▁
Credit %>% 
  select(Balance, Age, Rating) %>% 
  cor()
##           Balance       Age  Rating
## Balance 1.0000000 0.0018351 0.86363
## Age     0.0018351 1.0000000 0.10316
## Rating  0.8636252 0.1031650 1.00000
#Visualisations
Credit %>% 
  gf_point(Balance~Rating, data = Credit)

Credit %>% 
  gf_point(Balance~Age, data = Credit)

We find that Balance is not correlated with Age (0.1032) but has a very strong correlation with `Rating’ (0.8636). So in this case a simple (linear) regression would suffice to model.

6.2 Modelling

Credit %>% 
  select(Balance, Limit, Income) %>% 
  cor()
##         Balance   Limit  Income
## Balance 1.00000 0.86170 0.46366
## Limit   0.86170 1.00000 0.79209
## Income  0.46366 0.79209 1.00000
#
Balance_model <- lm(Balance~Limit+Income, data = Credit)
get_regression_table(Balance_model)
## # A tibble: 3 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept -385.       19.5       -19.8       0 -423.    -347.   
## 2 Limit        0.264     0.006      45.0       0    0.253    0.276
## 3 Income      -7.66      0.385     -19.9       0   -8.42    -6.91

However, recall that when considered separately, both Limit and Income had positive relationships with the outcome variable Balance. As card holders’ credit limits increased their credit card balances tended to increase as well, and a similar relationship held for incomes and balances. In the above multiple regression, however, the slope for Income is now -7.66, suggesting a negative relationship between income and credit card balance. What explains these contradictory results?

This is known as Simpson’s Paradox, a phenomenon in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. We expand on this in Subsection 7.3.2 where we’ll look at the relationship between credit Limit and credit card balance but split by different income bracket groups.

6.2.1 Problem LC7.2 Balance vs Rating and Age - Model

# Recall that:
Credit %>% 
  select(Balance, Age, Rating) %>% 
  cor()
##           Balance       Age  Rating
## Balance 1.0000000 0.0018351 0.86363
## Age     0.0018351 1.0000000 0.10316
## Rating  0.8636252 0.1031650 1.00000
#Modelling:
Balance_model2 <- lm(Balance ~ Rating + Age, data = Credit)
get_regression_table(Balance_model2)
## # A tibble: 3 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept  -270.      44.8       -6.02       0  -358.    -181.  
## 2 Rating        2.59     0.074     34.8        0     2.45     2.74
## 3 Age          -2.35     0.668     -3.52       0    -3.66    -1.04

We had noted that Age was poorly correlated ( 0.1032) with Balance. Here we see that in the model, Age is actually negatively correlated (-2.351) with Balance, while Rating is very strongly correlated with Balance. This is another example of ** Simpson’s Paradox** in action.

6.2.2 Observed and Fitted Values

# We performed:
# Balance_model <- lm(Balance ~ Limit + Income, data = Credit)

regression_points <- get_regression_points(Balance_model)
regression_points
## # A tibble: 400 x 6
##       ID Balance Limit Income Balance_hat residual
##    <int>   <int> <int>  <dbl>       <dbl>    <dbl>
##  1     1     333  3606   14.9        454.   -121. 
##  2     2     903  6645  106.         559.    344. 
##  3     3     580  7075  105.         683.   -103. 
##  4     4     964  9504  149.         986.    -21.7
##  5     5     331  4897   55.9        481.   -150. 
##  6     6    1151  8047   80.2       1127.     23.6
##  7     7     203  3388   21.0        349.   -146. 
##  8     8     872  7114   71.4        948.    -76.0
##  9     9     279  3300   15.1        371.    -92.2
## 10    10    1350  6819   71.1        873.    477. 
## # … with 390 more rows

6.3 One Numeric and one Categorical Explanatory variable

We use the Instructor eval data again.

\[ y = instructor\ score, numerical\ response\ variable \\ x_1 = Age, numerical\ explanatory\ variable\\ x_2 = Gender, categorical\ explanatory\ variable \]

evals_ch7 <- evals %>% select(score, gender, age)
evals_ch7 %>% skim()
Data summary
Name Piped data
Number of rows 463
Number of columns 3
_______________________
Column type frequency:
factor 1
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 mal: 268, fem: 195

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
score 0 1 4.17 0.54 2.3 3.8 4.3 4.6 5 ▁▁▅▇▇
age 0 1 48.37 9.80 29.0 42.0 48.0 57.0 73 ▅▆▇▆▁
evals_ch7 %>% glimpse()
## Rows: 463
## Columns: 3
## $ score  <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.…
## $ gender <fct> female, female, female, female, male, male, male, male, male, …
## $ age    <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40…
evals_ch7 %>% get_correlation(score ~ age)
## # A tibble: 1 x 1
##      cor
##    <dbl>
## 1 -0.107
# evals_ch7 %>% get_correlation(score ~ gender) # does not work.

gf_jitter(score ~ age, data = evals_ch7,color = ~ gender) %>% 
  gf_smooth(method = "lm",title = "Graphs before Modelling")

6.3.1 Multiple Regression - Parallel Slopes Model

Putting a + between explanatory variables creates model with a single intercept, i.e. a parallel slopes model.

score_model2 <- lm(score ~ age + gender, data = evals_ch7)
get_regression_table(score_model2)
## # A tibble: 3 x 7
##   term       estimate std_error statistic p_value lower_ci upper_ci
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept     4.48      0.125     35.8    0        4.24     4.73 
## 2 age          -0.009     0.003     -3.28   0.001   -0.014   -0.003
## 3 gendermale    0.191     0.052      3.63   0        0.087    0.294

The model can be expressed as: \[ \widehat{score} = b_0+b_{age}*age+b_{is-male}*1_{is-male}\\ where\\ 1_{is-male}= \left\{\begin{matrix} =1\ person\ is\ male\\ =0\ person\ is\ female \end{matrix}\right. \]

$b_{is-male} = $ 0.19057.

6.3.2 Visualising the Model

gf_jitter(score ~ age, data = evals_ch7,color = ~gender,title = "Fitting a Parallel Slopes Model") %>% 
  gf_abline(intercept = ~ 4.484, slope = ~ -0.009,color = ~ gender,data = filter(evals_ch7,gender == "male")) %>% 
  gf_abline(intercept = ~ 4.484 + 0.191,slope = ~ -0.009,color = ~ gender,data = filter(evals_ch7,gender == "female"))

As before,
- females are treated as baseline, since thus category appears before males alphabetically. - the intercepts (which in this case make no sense since no instructor can have an age of 0) are :

– for women: \(b_0 = 4.484\)
– for men: \(b_0 +b_{male} = 4.484 + 0.191 = 4.675\)

Both men and women models have the same slope. In other words, in this model the associated effect size of age is the same for men and women. So for every increase of one year in age, there is on average an associated change of \(b_{age} = -0.009\) (a decrease) in teaching score.

The two plots (before and after model) show that there may be an interaction effect between the two explanatory variables. So we next model that.

6.4 Multiple Regression - Interaction Effects

We model interaction between independent variables using the * symbol between them in the lm formula.

score_model3 <- lm(score ~ age * gender, data = evals_ch7)
get_regression_table(score_model3)
## # A tibble: 4 x 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept         4.88      0.205     23.8    0        4.48     5.29 
## 2 age              -0.018     0.004     -3.92   0       -0.026   -0.009
## 3 gendermale       -0.446     0.265     -1.68   0.094   -0.968    0.076
## 4 age:gendermale    0.014     0.006      2.45   0.015    0.003    0.024

This gives us a model as: $$ = b_0 + b_{age} * age + b_{male} * 1_{is-male}(x) +\ b_{age,male} * age * 1_{is-male}(x)\

= 4.883 -0.018 * age -0.446 * 1_{is-male}(x)\ + 0.014 * age * 1_{is-male}(x) $$

Hence the two models for the two genders become

\[ \widehat{score_{male}}\\ = 4.883 -0.018*age - 0.446 +0.014 * age \\= 4.437 - 0.004 * age \]

\[ \widehat{score_{female}} = 4.883 - 0.018*age \]

So if we visualize this we get:

gf_point(score ~ age,data = evals_ch7,color = ~gender) %>%   
  gf_abline(intercept = ~ 4.437, slope = ~ -0.004, color = ~ gender,data = filter(evals_ch7,gender == "male")) %>% 
  gf_abline(intercept = ~ 4.883,slope = ~ -0.018, color = ~ gender, data = filter(evals_ch7,gender == "female"))

It can be seen that the effect of age on the teaching scores of female teachers is more harsh, compared to that of male teachers. Who said it was a fair world?

We can also obtain the observed and fitted values and residuals:

regression_points <- get_regression_points(score_model3)
regression_points
## # A tibble: 463 x 6
##       ID score   age gender score_hat residual
##    <int> <dbl> <int> <fct>      <dbl>    <dbl>
##  1     1   4.7    36 female      4.25    0.448
##  2     2   4.1    36 female      4.25   -0.152
##  3     3   3.9    36 female      4.25   -0.352
##  4     4   4.8    36 female      4.25    0.548
##  5     5   4.6    59 male        4.20    0.399
##  6     6   4.3    59 male        4.20    0.099
##  7     7   2.8    59 male        4.20   -1.40 
##  8     8   4.1    51 male        4.23   -0.133
##  9     9   3.4    51 male        4.23   -0.833
## 10    10   4.5    40 female      4.18    0.318
## # … with 453 more rows

We can plot the residuals too:

gf_jitter(score ~ age, data = regression_points, color = ~ gender) %>%
  gf_point(score_hat ~ age, data = regression_points, color = ~ gender) %>%
  gf_segment(score_hat + score ~ age + age, data = regression_points %>% filter(gender == "female")) %>%
  gf_segment(score_hat + score ~ age + age, data = regression_points %>% filter(gender == "male")) %>%
  gf_abline(
    intercept = ~ 4.437,
    slope = ~ -0.004,
    color = "blue",
    data = regression_points
  ) %>%
  gf_abline(
    intercept = ~ 4.883,
    slope = ~ -0.018,
    color = "magenta",
    data = regression_points
  )

6.5 Simpson’s Paradox

We saw the two following seemingly contradictory results when studying the relationship between credit card balance, credit limit, and income. On the one hand, the right hand plot of Figure 7.1 suggested that credit card balance and income were positively related:

gf_point(Balance ~ Limit, data = Credit) %>% 
  gf_smooth(method = "lm")
gf_point(Balance~ Income, data = Credit)%>% 
  gf_smooth(method = "lm")

# However
lm(Balance ~ Limit + Income, data = Credit) %>% 
  get_regression_table()
## # A tibble: 3 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept -385.       19.5       -19.8       0 -423.    -347.   
## 2 Limit        0.264     0.006      45.0       0    0.253    0.276
## 3 Income      -7.66      0.385     -19.9       0   -8.42    -6.91

> On the other hand, the multiple regression in Table 7.3, suggested that when modeling credit card Balance as: \(Balance = f(Limit,Income)\) credit limit has a negative relationship with balance, as evidenced by the slope of -7.66. How can this be?

If we create the histogram of `Limit’:

gf_histogram(~ Limit,binwidth = 200,data = Credit,color = "white") %>% 
  gf_vline(xintercept = ~ quantile(x = Credit$Limit,probs = 0.25), color = "red", linetype = "dashed") %>% 
  gf_vline(xintercept = ~ quantile(x = Credit$Limit,probs = 0.5), color = "red", linetype = "dashed") %>% 
  gf_vline(xintercept = ~ quantile(x = Credit$Limit,probs = 0.75), color = "red", linetype = "dashed")
## Warning: Use of `Credit$Limit` is discouraged. Use `Limit` instead.

## Warning: Use of `Credit$Limit` is discouraged. Use `Limit` instead.

## Warning: Use of `Credit$Limit` is discouraged. Use `Limit` instead.

then we can split the income range into low, medium-low, medium-high and high brackets.

We can then split the individuals into groups as per the above and then see if they, in their groups, have a positive correlation with Credit Card Balance.

Limit_quantiles <-  with(Credit, quantile(Limit, probs = c(0, 0.25, 0.5, 0.75,1)))
Limit_quantiles
##      0%     25%     50%     75%    100% 
##   855.0  3088.0  4622.5  5872.8 13913.0
Credit <-
  Credit %>% mutate(Type = cut(
    Limit,
    breaks = Limit_quantiles,
    labels = c("Low", "Medium_low", "Medium-high", "High"),
    right = FALSE,
    ordered_result = TRUE
  ))
Credit
##      ID  Income Limit Rating Cards Age Education Gender Student Married
## 1     1  14.891  3606    283     2  34        11   Male      No     Yes
## 2     2 106.025  6645    483     3  82        15 Female     Yes     Yes
## 3     3 104.593  7075    514     4  71        11   Male      No      No
## 4     4 148.924  9504    681     3  36        11 Female      No      No
## 5     5  55.882  4897    357     2  68        16   Male      No     Yes
## 6     6  80.180  8047    569     4  77        10   Male      No      No
## 7     7  20.996  3388    259     2  37        12 Female      No      No
## 8     8  71.408  7114    512     2  87         9   Male      No      No
## 9     9  15.125  3300    266     5  66        13 Female      No      No
## 10   10  71.061  6819    491     3  41        19 Female     Yes     Yes
## 11   11  63.095  8117    589     4  30        14   Male      No     Yes
## 12   12  15.045  1311    138     3  64        16   Male      No      No
## 13   13  80.616  5308    394     1  57         7 Female      No     Yes
## 14   14  43.682  6922    511     1  49         9   Male      No     Yes
## 15   15  19.144  3291    269     2  75        13 Female      No      No
## 16   16  20.089  2525    200     3  57        15 Female      No     Yes
## 17   17  53.598  3714    286     3  73        17 Female      No     Yes
## 18   18  36.496  4378    339     3  69        15 Female      No     Yes
## 19   19  49.570  6384    448     1  28         9 Female      No     Yes
## 20   20  42.079  6626    479     2  44         9   Male      No      No
## 21   21  17.700  2860    235     4  63        16 Female      No      No
## 22   22  37.348  6378    458     1  72        17 Female      No      No
## 23   23  20.103  2631    213     3  61        10   Male      No     Yes
## 24   24  64.027  5179    398     5  48         8   Male      No     Yes
## 25   25  10.742  1757    156     3  57        15 Female      No      No
## 26   26  14.090  4323    326     5  25        16 Female      No     Yes
## 27   27  42.471  3625    289     6  44        12 Female     Yes      No
## 28   28  32.793  4534    333     2  44        16   Male      No      No
## 29   29 186.634 13414    949     2  41        14 Female      No     Yes
## 30   30  26.813  5611    411     4  55        16 Female      No      No
## 31   31  34.142  5666    413     4  47         5 Female      No     Yes
## 32   32  28.941  2733    210     5  43        16   Male      No     Yes
## 33   33 134.181  7838    563     2  48        13 Female      No      No
## 34   34  31.367  1829    162     4  30        10   Male      No     Yes
## 35   35  20.150  2646    199     2  25        14 Female      No     Yes
## 36   36  23.350  2558    220     3  49        12 Female     Yes      No
## 37   37  62.413  6457    455     2  71        11 Female      No     Yes
## 38   38  30.007  6481    462     2  69         9 Female      No     Yes
## 39   39  11.795  3899    300     4  25        10 Female      No      No
## 40   40  13.647  3461    264     4  47        14   Male      No     Yes
## 41   41  34.950  3327    253     3  54        14 Female      No      No
## 42   42 113.659  7659    538     2  66        15   Male     Yes     Yes
## 43   43  44.158  4763    351     2  66        13 Female      No     Yes
## 44   44  36.929  6257    445     1  24        14 Female      No     Yes
## 45   45  31.861  6375    469     3  25        16 Female      No     Yes
## 46   46  77.380  7569    564     3  50        12 Female      No     Yes
## 47   47  19.531  5043    376     2  64        16 Female     Yes     Yes
## 48   48  44.646  4431    320     2  49        15   Male     Yes     Yes
## 49   49  44.522  2252    205     6  72        15   Male      No     Yes
## 50   50  43.479  4569    354     4  49        13   Male     Yes     Yes
## 51   51  36.362  5183    376     3  49        15   Male      No     Yes
## 52   52  39.705  3969    301     2  27        20   Male      No     Yes
## 53   53  44.205  5441    394     1  32        12   Male      No     Yes
## 54   54  16.304  5466    413     4  66        10   Male      No     Yes
## 55   55  15.333  1499    138     2  47         9 Female      No     Yes
## 56   56  32.916  1786    154     2  60         8 Female      No     Yes
## 57   57  57.100  4742    372     7  79        18 Female      No     Yes
## 58   58  76.273  4779    367     4  65        14 Female      No     Yes
## 59   59  10.354  3480    281     2  70        17   Male      No     Yes
## 60   60  51.872  5294    390     4  81        17 Female      No      No
## 61   61  35.510  5198    364     2  35        20 Female      No      No
## 62   62  21.238  3089    254     3  59        10 Female      No      No
## 63   63  30.682  1671    160     2  77         7 Female      No      No
## 64   64  14.132  2998    251     4  75        17   Male      No      No
## 65   65  32.164  2937    223     2  79        15 Female      No     Yes
## 66   66  12.000  4160    320     4  28        14 Female      No     Yes
## 67   67 113.829  9704    694     4  38        13 Female      No     Yes
## 68   68  11.187  5099    380     4  69        16 Female      No      No
## 69   69  27.847  5619    418     2  78        15 Female      No     Yes
## 70   70  49.502  6819    505     4  55        14   Male      No     Yes
## 71   71  24.889  3954    318     4  75        12   Male      No     Yes
## 72   72  58.781  7402    538     2  81        12 Female      No     Yes
## 73   73  22.939  4923    355     1  47        18 Female      No     Yes
## 74   74  23.989  4523    338     4  31        15   Male      No      No
## 75   75  16.103  5390    418     4  45        10 Female      No     Yes
## 76   76  33.017  3180    224     2  28        16   Male      No     Yes
## 77   77  30.622  3293    251     1  68        16   Male     Yes      No
## 78   78  20.936  3254    253     1  30        15 Female      No      No
## 79   79 110.968  6662    468     3  45        11 Female      No     Yes
## 80   80  15.354  2101    171     2  65        14   Male      No      No
## 81   81  27.369  3449    288     3  40         9 Female      No     Yes
## 82   82  53.480  4263    317     1  83        15   Male      No      No
## 83   83  23.672  4433    344     3  63        11   Male      No      No
## 84   84  19.225  1433    122     3  38        14 Female      No      No
## 85   85  43.540  2906    232     4  69        11   Male      No      No
## 86   86 152.298 12066    828     4  41        12 Female      No     Yes
## 87   87  55.367  6340    448     1  33        15   Male      No     Yes
## 88   88  11.741  2271    182     4  59        12 Female      No      No
## 89   89  15.560  4307    352     4  57         8   Male      No     Yes
## 90   90  59.530  7518    543     3  52         9 Female      No      No
## 91   91  20.191  5767    431     4  42        16   Male      No     Yes
## 92   92  48.498  6040    456     3  47        16   Male      No     Yes
## 93   93  30.733  2832    249     4  51        13   Male      No      No
## 94   94  16.479  5435    388     2  26        16   Male      No      No
## 95   95  38.009  3075    245     3  45        15 Female      No      No
## 96   96  14.084   855    120     5  46        17 Female      No     Yes
## 97   97  14.312  5382    367     1  59        17   Male     Yes      No
## 98   98  26.067  3388    266     4  74        17 Female      No     Yes
## 99   99  36.295  2963    241     2  68        14 Female     Yes      No
## 100 100  83.851  8494    607     5  47        18   Male      No      No
## 101 101  21.153  3736    256     1  41        11   Male      No      No
## 102 102  17.976  2433    190     3  70        16 Female     Yes      No
## 103 103  68.713  7582    531     2  56        16   Male     Yes      No
## 104 104 146.183  9540    682     6  66        15   Male      No      No
## 105 105  15.846  4768    365     4  53        12 Female      No      No
## 106 106  12.031  3182    259     2  58        18 Female      No     Yes
## 107 107  16.819  1337    115     2  74        15   Male      No     Yes
## 108 108  39.110  3189    263     3  72        12   Male      No      No
## 109 109 107.986  6033    449     4  64        14   Male      No     Yes
## 110 110  13.561  3261    279     5  37        19   Male      No     Yes
## 111 111  34.537  3271    250     3  57        17 Female      No     Yes
## 112 112  28.575  2959    231     2  60        11 Female      No      No
## 113 113  46.007  6637    491     4  42        14   Male      No     Yes
## 114 114  69.251  6386    474     4  30        12 Female      No     Yes
## 115 115  16.482  3326    268     4  41        15   Male      No      No
## 116 116  40.442  4828    369     5  81         8 Female      No      No
## 117 117  35.177  2117    186     3  62        16 Female      No      No
## 118 118  91.362  9113    626     1  47        17   Male      No     Yes
## 119 119  27.039  2161    173     3  40        17 Female      No      No
## 120 120  23.012  1410    137     3  81        16   Male      No      No
## 121 121  27.241  1402    128     2  67        15 Female      No     Yes
## 122 122 148.080  8157    599     2  83        13   Male      No     Yes
## 123 123  62.602  7056    481     1  84        11 Female      No      No
## 124 124  11.808  1300    117     3  77        14 Female      No      No
## 125 125  29.564  2529    192     1  30        12 Female      No     Yes
## 126 126  27.578  2531    195     1  34        15 Female      No     Yes
## 127 127  26.427  5533    433     5  50        15 Female     Yes     Yes
## 128 128  57.202  3411    259     3  72        11 Female      No      No
## 129 129 123.299  8376    610     2  89        17   Male     Yes      No
## 130 130  18.145  3461    279     3  56        15   Male      No     Yes
## 131 131  23.793  3821    281     4  56        12 Female     Yes     Yes
## 132 132  10.726  1568    162     5  46        19   Male      No     Yes
## 133 133  23.283  5443    407     4  49        13   Male      No     Yes
## 134 134  21.455  5829    427     4  80        12 Female      No     Yes
## 135 135  34.664  5835    452     3  77        15 Female      No     Yes
## 136 136  44.473  3500    257     3  81        16 Female      No      No
## 137 137  54.663  4116    314     2  70         8 Female      No      No
## 138 138  36.355  3613    278     4  35         9   Male      No     Yes
## 139 139  21.374  2073    175     2  74        11 Female      No     Yes
## 140 140 107.841 10384    728     3  87         7   Male      No      No
## 141 141  39.831  6045    459     3  32        12 Female     Yes     Yes
## 142 142  91.876  6754    483     2  33        10   Male      No     Yes
## 143 143 103.893  7416    549     3  84        17   Male      No      No
## 144 144  19.636  4896    387     3  64        10 Female      No      No
## 145 145  17.392  2748    228     3  32        14   Male      No     Yes
## 146 146  19.529  4673    341     2  51        14   Male      No      No
## 147 147  17.055  5110    371     3  55        15 Female      No     Yes
## 148 148  23.857  1501    150     3  56        16   Male      No     Yes
## 149 149  15.184  2420    192     2  69        11 Female      No     Yes
## 150 150  13.444   886    121     5  44        10   Male      No     Yes
## 151 151  63.931  5728    435     3  28        14 Female      No     Yes
## 152 152  35.864  4831    353     3  66        13 Female      No     Yes
## 153 153  41.419  2120    184     4  24        11 Female     Yes      No
## 154 154  92.112  4612    344     3  32        17   Male      No      No
## 155 155  55.056  3155    235     2  31        16   Male      No     Yes
## 156 156  19.537  1362    143     4  34         9 Female      No     Yes
## 157 157  31.811  4284    338     5  75        13 Female      No     Yes
## 158 158  56.256  5521    406     2  72        16 Female     Yes     Yes
## 159 159  42.357  5550    406     2  83        12 Female      No     Yes
## 160 160  53.319  3000    235     3  53        13   Male      No      No
## 161 161  12.238  4865    381     5  67        11 Female      No      No
## 162 162  31.353  1705    160     3  81        14   Male      No     Yes
## 163 163  63.809  7530    515     1  56        12   Male      No     Yes
## 164 164  13.676  2330    203     5  80        16 Female      No      No
## 165 165  76.782  5977    429     4  44        12   Male      No     Yes
## 166 166  25.383  4527    367     4  46        11   Male      No     Yes
## 167 167  35.691  2880    214     2  35        15   Male      No      No
## 168 168  29.403  2327    178     1  37        14 Female      No     Yes
## 169 169  27.470  2820    219     1  32        11 Female      No     Yes
## 170 170  27.330  6179    459     4  36        12 Female      No     Yes
## 171 171  34.772  2021    167     3  57         9   Male      No      No
## 172 172  36.934  4270    299     1  63         9 Female      No     Yes
## 173 173  76.348  4697    344     4  60        18   Male      No      No
## 174 174  14.887  4745    339     3  58        12   Male      No     Yes
## 175 175 121.834 10673    750     3  54        16   Male      No      No
## 176 176  30.132  2168    206     3  52        17   Male      No      No
## 177 177  24.050  2607    221     4  32        18   Male      No     Yes
## 178 178  22.379  3965    292     2  34        14 Female      No     Yes
## 179 179  28.316  4391    316     2  29        10 Female      No      No
## 180 180  58.026  7499    560     5  67        11 Female      No      No
## 181 181  10.635  3584    294     5  69        16   Male      No     Yes
## 182 182  46.102  5180    382     3  81        12   Male      No     Yes
## 183 183  58.929  6420    459     2  66         9 Female      No     Yes
## 184 184  80.861  4090    335     3  29        15 Female      No     Yes
## 185 185 158.889 11589    805     1  62        17 Female      No     Yes
## 186 186  30.420  4442    316     1  30        14 Female      No      No
## 187 187  36.472  3806    309     2  52        13   Male      No      No
## 188 188  23.365  2179    167     2  75        15   Male      No      No
## 189 189  83.869  7667    554     2  83        11   Male      No      No
## 190 190  58.351  4411    326     2  85        16 Female      No     Yes
## 191 191  55.187  5352    385     4  50        17 Female      No     Yes
## 192 192 124.290  9560    701     3  52        17 Female     Yes      No
## 193 193  28.508  3933    287     4  56        14   Male      No     Yes
## 194 194 130.209 10088    730     7  39        19 Female      No     Yes
## 195 195  30.406  2120    181     2  79        14   Male      No     Yes
## 196 196  23.883  5384    398     2  73        16 Female      No     Yes
## 197 197  93.039  7398    517     1  67        12   Male      No     Yes
## 198 198  50.699  3977    304     2  84        17 Female      No      No
## 199 199  27.349  2000    169     4  51        16 Female      No     Yes
## 200 200  10.403  4159    310     3  43         7   Male      No     Yes
## 201 201  23.949  5343    383     2  40        18   Male      No     Yes
## 202 202  73.914  7333    529     6  67        15 Female      No     Yes
## 203 203  21.038  1448    145     2  58        13 Female      No     Yes
## 204 204  68.206  6784    499     5  40        16 Female     Yes      No
## 205 205  57.337  5310    392     2  45         7 Female      No      No
## 206 206  10.793  3878    321     8  29        13   Male      No      No
## 207 207  23.450  2450    180     2  78        13   Male      No      No
## 208 208  10.842  4391    358     5  37        10 Female     Yes     Yes
## 209 209  51.345  4327    320     3  46        15   Male      No      No
## 210 210 151.947  9156    642     2  91        11 Female      No     Yes
## 211 211  24.543  3206    243     2  62        12 Female      No     Yes
## 212 212  29.567  5309    397     3  25        15   Male      No      No
## 213 213  39.145  4351    323     2  66        13   Male      No     Yes
## 214 214  39.422  5245    383     2  44        19   Male      No      No
## 215 215  34.909  5289    410     2  62        16 Female      No     Yes
## 216 216  41.025  4229    337     3  79        19 Female      No     Yes
## 217 217  15.476  2762    215     3  60        18   Male      No      No
## 218 218  12.456  5395    392     3  65        14   Male      No     Yes
## 219 219  10.627  1647    149     2  71        10 Female     Yes     Yes
## 220 220  38.954  5222    370     4  76        13 Female      No      No
## 221 221  44.847  5765    437     3  53        13 Female     Yes      No
## 222 222  98.515  8760    633     5  78        11 Female      No      No
## 223 223  33.437  6207    451     4  44         9   Male     Yes      No
## 224 224  27.512  4613    344     5  72        17   Male      No     Yes
## 225 225 121.709  7818    584     4  50         6   Male      No     Yes
## 226 226  15.079  5673    411     4  28        15 Female      No     Yes
## 227 227  59.879  6906    527     6  78        15 Female      No      No
## 228 228  66.989  5614    430     3  47        14 Female      No     Yes
## 229 229  69.165  4668    341     2  34        11 Female      No      No
## 230 230  69.943  7555    547     3  76         9   Male      No     Yes
## 231 231  33.214  5137    387     3  59         9   Male      No      No
## 232 232  25.124  4776    378     4  29        12   Male      No     Yes
## 233 233  15.741  4788    360     1  39        14   Male      No     Yes
## 234 234  11.603  2278    187     3  71        11   Male      No     Yes
## 235 235  69.656  8244    579     3  41        14   Male      No     Yes
## 236 236  10.503  2923    232     3  25        18 Female      No     Yes
## 237 237  42.529  4986    369     2  37        11   Male      No     Yes
## 238 238  60.579  5149    388     5  38        15   Male      No     Yes
## 239 239  26.532  2910    236     6  58        19 Female      No     Yes
## 240 240  27.952  3557    263     1  35        13 Female      No     Yes
## 241 241  29.705  3351    262     5  71        14 Female      No     Yes
## 242 242  15.602   906    103     2  36        11   Male      No     Yes
## 243 243  20.918  1233    128     3  47        18 Female     Yes     Yes
## 244 244  58.165  6617    460     1  56        12 Female      No     Yes
## 245 245  22.561  1787    147     4  66        15 Female      No      No
## 246 246  34.509  2001    189     5  80        18 Female      No     Yes
## 247 247  19.588  3211    265     4  59        14 Female      No      No
## 248 248  36.364  2220    188     3  50        19   Male      No      No
## 249 249  15.717   905     93     1  38        16   Male     Yes     Yes
## 250 250  22.574  1551    134     3  43        13 Female     Yes     Yes
## 251 251  10.363  2430    191     2  47        18 Female      No     Yes
## 252 252  28.474  3202    267     5  66        12   Male      No     Yes
## 253 253  72.945  8603    621     3  64         8 Female      No      No
## 254 254  85.425  5182    402     6  60        12   Male      No     Yes
## 255 255  36.508  6386    469     4  79         6 Female      No     Yes
## 256 256  58.063  4221    304     3  50         8   Male      No      No
## 257 257  25.936  1774    135     2  71        14 Female      No      No
## 258 258  15.629  2493    186     1  60        14   Male      No     Yes
## 259 259  41.400  2561    215     2  36        14   Male      No     Yes
## 260 260  33.657  6196    450     6  55         9 Female      No      No
## 261 261  67.937  5184    383     4  63        12   Male      No     Yes
## 262 262 180.379  9310    665     3  67         8 Female     Yes     Yes
## 263 263  10.588  4049    296     1  66        13 Female      No     Yes
## 264 264  29.725  3536    270     2  52        15 Female      No      No
## 265 265  27.999  5107    380     1  55        10   Male      No     Yes
## 266 266  40.885  5013    379     3  46        13 Female      No     Yes
## 267 267  88.830  4952    360     4  86        16 Female      No     Yes
## 268 268  29.638  5833    433     3  29        15 Female      No     Yes
## 269 269  25.988  1349    142     4  82        12   Male      No      No
## 270 270  39.055  5565    410     4  48        18 Female      No     Yes
## 271 271  15.866  3085    217     1  39        13   Male      No      No
## 272 272  44.978  4866    347     1  30        10 Female      No      No
## 273 273  30.413  3690    299     2  25        15 Female     Yes      No
## 274 274  16.751  4706    353     6  48        14   Male     Yes      No
## 275 275  30.550  5869    439     5  81         9 Female      No      No
## 276 276 163.329  8732    636     3  50        14   Male      No     Yes
## 277 277  23.106  3476    257     2  50        15 Female      No      No
## 278 278  41.532  5000    353     2  50        12   Male      No     Yes
## 279 279 128.040  6982    518     2  78        11 Female      No     Yes
## 280 280  54.319  3063    248     3  59         8 Female     Yes      No
## 281 281  53.401  5319    377     3  35        12 Female      No      No
## 282 282  36.142  1852    183     3  33        13 Female      No      No
## 283 283  63.534  8100    581     2  50        17 Female      No     Yes
## 284 284  49.927  6396    485     3  75        17 Female      No     Yes
## 285 285  14.711  2047    167     2  67         6   Male      No     Yes
## 286 286  18.967  1626    156     2  41        11 Female      No     Yes
## 287 287  18.036  1552    142     2  48        15 Female      No      No
## 288 288  60.449  3098    272     4  69         8   Male      No     Yes
## 289 289  16.711  5274    387     3  42        16 Female      No     Yes
## 290 290  10.852  3907    296     2  30         9   Male      No      No
## 291 291  26.370  3235    268     5  78        11   Male      No     Yes
## 292 292  24.088  3665    287     4  56        13 Female      No     Yes
## 293 293  51.532  5096    380     2  31        15   Male      No     Yes
## 294 294 140.672 11200    817     7  46         9   Male      No     Yes
## 295 295  42.915  2532    205     4  42        13   Male      No     Yes
## 296 296  27.272  1389    149     5  67        10 Female      No     Yes
## 297 297  65.896  5140    370     1  49        17 Female      No     Yes
## 298 298  55.054  4381    321     3  74        17   Male      No     Yes
## 299 299  20.791  2672    204     1  70        18 Female      No      No
## 300 300  24.919  5051    372     3  76        11 Female      No     Yes
## 301 301  21.786  4632    355     1  50        17   Male      No     Yes
## 302 302  31.335  3526    289     3  38         7 Female      No      No
## 303 303  59.855  4964    365     1  46        13 Female      No     Yes
## 304 304  44.061  4970    352     1  79        11   Male      No     Yes
## 305 305  82.706  7506    536     2  64        13 Female      No     Yes
## 306 306  24.460  1924    165     2  50        14 Female      No     Yes
## 307 307  45.120  3762    287     3  80         8   Male      No     Yes
## 308 308  75.406  3874    298     3  41        14 Female      No     Yes
## 309 309  14.956  4640    332     2  33         6   Male      No      No
## 310 310  75.257  7010    494     3  34        18 Female      No     Yes
## 311 311  33.694  4891    369     1  52        16   Male     Yes      No
## 312 312  23.375  5429    396     3  57        15 Female      No     Yes
## 313 313  27.825  5227    386     6  63        11   Male      No     Yes
## 314 314  92.386  7685    534     2  75        18 Female      No     Yes
## 315 315 115.520  9272    656     2  69        14   Male      No      No
## 316 316  14.479  3907    296     3  43        16   Male      No     Yes
## 317 317  52.179  7306    522     2  57        14   Male      No      No
## 318 318  68.462  4712    340     2  71        16   Male      No     Yes
## 319 319  18.951  1485    129     3  82        13 Female      No      No
## 320 320  27.590  2586    229     5  54        16   Male      No     Yes
## 321 321  16.279  1160    126     3  78        13   Male     Yes     Yes
## 322 322  25.078  3096    236     2  27        15 Female      No     Yes
## 323 323  27.229  3484    282     6  51        11   Male      No      No
## 324 324 182.728 13913    982     4  98        17   Male      No     Yes
## 325 325  31.029  2863    223     2  66        17   Male     Yes     Yes
## 326 326  17.765  5072    364     1  66        12 Female      No     Yes
## 327 327 125.480 10230    721     3  82        16   Male      No     Yes
## 328 328  49.166  6662    508     3  68        14 Female      No      No
## 329 329  41.192  3673    297     3  54        16 Female      No     Yes
## 330 330  94.193  7576    527     2  44        16 Female      No     Yes
## 331 331  20.405  4543    329     2  72        17   Male     Yes      No
## 332 332  12.581  3976    291     2  48        16   Male      No     Yes
## 333 333  62.328  5228    377     3  83        15   Male      No      No
## 334 334  21.011  3402    261     2  68        17   Male      No     Yes
## 335 335  24.230  4756    351     2  64        15 Female      No     Yes
## 336 336  24.314  3409    270     2  23         7 Female      No     Yes
## 337 337  32.856  5884    438     4  68        13   Male      No      No
## 338 338  12.414   855    119     3  32        12   Male      No     Yes
## 339 339  41.365  5303    377     1  45        14   Male      No      No
## 340 340 149.316 10278    707     1  80        16   Male      No      No
## 341 341  27.794  3807    301     4  35         8 Female      No     Yes
## 342 342  13.234  3922    299     2  77        17 Female      No     Yes
## 343 343  14.595  2955    260     5  37         9   Male      No     Yes
## 344 344  10.735  3746    280     2  44        17 Female      No     Yes
## 345 345  48.218  5199    401     7  39        10   Male      No     Yes
## 346 346  30.012  1511    137     2  33        17   Male      No     Yes
## 347 347  21.551  5380    420     5  51        18   Male      No     Yes
## 348 348 160.231 10748    754     2  69        17   Male      No      No
## 349 349  13.433  1134    112     3  70        14   Male      No     Yes
## 350 350  48.577  5145    389     3  71        13 Female      No     Yes
## 351 351  30.002  1561    155     4  70        13 Female      No     Yes
## 352 352  61.620  5140    374     1  71         9   Male      No     Yes
## 353 353 104.483  7140    507     2  41        14   Male      No     Yes
## 354 354  41.868  4716    342     2  47        18   Male      No      No
## 355 355  12.068  3873    292     1  44        18 Female      No     Yes
## 356 356 180.682 11966    832     2  58         8 Female      No     Yes
## 357 357  34.480  6090    442     3  36        14   Male      No      No
## 358 358  39.609  2539    188     1  40        14   Male      No     Yes
## 359 359  30.111  4336    339     1  81        18   Male      No     Yes
## 360 360  12.335  4471    344     3  79        12   Male      No     Yes
## 361 361  53.566  5891    434     4  82        10 Female      No      No
## 362 362  53.217  4943    362     2  46        16 Female      No     Yes
## 363 363  26.162  5101    382     3  62        19 Female      No      No
## 364 364  64.173  6127    433     1  80        10   Male      No     Yes
## 365 365 128.669  9824    685     3  67        16   Male      No     Yes
## 366 366 113.772  6442    489     4  69        15   Male     Yes     Yes
## 367 367  61.069  7871    564     3  56        14   Male      No     Yes
## 368 368  23.793  3615    263     2  70        14   Male      No      No
## 369 369  89.000  5759    440     3  37         6 Female      No      No
## 370 370  71.682  8028    599     3  57        16   Male      No     Yes
## 371 371  35.610  6135    466     4  40        12   Male      No      No
## 372 372  39.116  2150    173     4  75        15   Male      No      No
## 373 373  19.782  3782    293     2  46        16 Female     Yes      No
## 374 374  55.412  5354    383     2  37        16 Female     Yes     Yes
## 375 375  29.400  4840    368     3  76        18 Female      No     Yes
## 376 376  20.974  5673    413     5  44        16 Female      No     Yes
## 377 377  87.625  7167    515     2  46        10 Female      No      No
## 378 378  28.144  1567    142     3  51        10   Male      No     Yes
## 379 379  19.349  4941    366     1  33        19   Male      No     Yes
## 380 380  53.308  2860    214     1  84        10   Male      No     Yes
## 381 381 115.123  7760    538     3  83        14 Female      No      No
## 382 382 101.788  8029    574     2  84        11   Male      No     Yes
## 383 383  24.824  5495    409     1  33         9   Male     Yes      No
## 384 384  14.292  3274    282     9  64         9   Male      No     Yes
## 385 385  20.088  1870    180     3  76        16   Male      No      No
## 386 386  26.400  5640    398     3  58        15 Female      No      No
## 387 387  19.253  3683    287     4  57        10   Male      No      No
## 388 388  16.529  1357    126     3  62         9   Male      No      No
## 389 389  37.878  6827    482     2  80        13 Female      No      No
## 390 390  83.948  7100    503     2  44        18   Male      No      No
## 391 391 135.118 10578    747     3  81        15 Female      No     Yes
## 392 392  73.327  6555    472     2  43        15 Female      No      No
## 393 393  25.974  2308    196     2  24        10   Male      No      No
## 394 394  17.316  1335    138     2  65        13   Male      No      No
## 395 395  49.794  5758    410     4  40         8   Male      No      No
## 396 396  12.096  4100    307     3  32        13   Male      No     Yes
## 397 397  13.364  3838    296     5  65        17   Male      No      No
## 398 398  57.872  4171    321     5  67        12 Female      No     Yes
## 399 399  37.728  2525    192     1  44        13   Male      No     Yes
## 400 400  18.701  5524    415     5  64         7 Female      No      No
##            Ethnicity Balance        Type
## 1          Caucasian     333  Medium_low
## 2              Asian     903        High
## 3              Asian     580        High
## 4              Asian     964        High
## 5          Caucasian     331 Medium-high
## 6          Caucasian    1151        High
## 7   African American     203  Medium_low
## 8              Asian     872        High
## 9          Caucasian     279  Medium_low
## 10  African American    1350        High
## 11         Caucasian    1407        High
## 12         Caucasian       0         Low
## 13             Asian     204 Medium-high
## 14         Caucasian    1081        High
## 15  African American     148  Medium_low
## 16  African American       0         Low
## 17  African American       0  Medium_low
## 18             Asian     368  Medium_low
## 19             Asian     891        High
## 20             Asian    1048        High
## 21             Asian      89         Low
## 22         Caucasian     968        High
## 23  African American       0         Low
## 24  African American     411 Medium-high
## 25         Caucasian       0         Low
## 26  African American     671  Medium_low
## 27         Caucasian     654  Medium_low
## 28  African American     467  Medium_low
## 29  African American    1809        High
## 30         Caucasian     915 Medium-high
## 31         Caucasian     863 Medium-high
## 32             Asian       0         Low
## 33         Caucasian     526        High
## 34         Caucasian       0         Low
## 35             Asian       0         Low
## 36         Caucasian     419         Low
## 37         Caucasian     762        High
## 38         Caucasian    1093        High
## 39         Caucasian     531  Medium_low
## 40         Caucasian     344  Medium_low
## 41  African American      50  Medium_low
## 42  African American    1155        High
## 43             Asian     385 Medium-high
## 44             Asian     976        High
## 45         Caucasian    1120        High
## 46         Caucasian     997        High
## 47             Asian    1241 Medium-high
## 48         Caucasian     797  Medium_low
## 49             Asian       0         Low
## 50  African American     902  Medium_low
## 51  African American     654 Medium-high
## 52  African American     211  Medium_low
## 53         Caucasian     607 Medium-high
## 54             Asian     957 Medium-high
## 55             Asian       0         Low
## 56             Asian       0         Low
## 57             Asian     379 Medium-high
## 58         Caucasian     133 Medium-high
## 59         Caucasian     333  Medium_low
## 60         Caucasian     531 Medium-high
## 61             Asian     631 Medium-high
## 62         Caucasian     108  Medium_low
## 63         Caucasian       0         Low
## 64         Caucasian     133         Low
## 65  African American       0         Low
## 66         Caucasian     602  Medium_low
## 67             Asian    1388        High
## 68  African American     889 Medium-high
## 69         Caucasian     822 Medium-high
## 70         Caucasian    1084        High
## 71         Caucasian     357  Medium_low
## 72             Asian    1103        High
## 73             Asian     663 Medium-high
## 74         Caucasian     601  Medium_low
## 75         Caucasian     945 Medium-high
## 76  African American      29  Medium_low
## 77         Caucasian     532  Medium_low
## 78             Asian     145  Medium_low
## 79         Caucasian     391        High
## 80             Asian       0         Low
## 81         Caucasian     162  Medium_low
## 82         Caucasian      99  Medium_low
## 83         Caucasian     503  Medium_low
## 84         Caucasian       0         Low
## 85         Caucasian       0         Low
## 86             Asian    1779        High
## 87         Caucasian     815        High
## 88             Asian       0         Low
## 89  African American     579  Medium_low
## 90  African American    1176        High
## 91  African American    1023 Medium-high
## 92         Caucasian     812        High
## 93         Caucasian       0         Low
## 94  African American     937 Medium-high
## 95  African American       0         Low
## 96  African American       0         Low
## 97             Asian    1380 Medium-high
## 98  African American     155  Medium_low
## 99  African American     375         Low
## 100        Caucasian    1311        High
## 101        Caucasian     298  Medium_low
## 102        Caucasian     431         Low
## 103        Caucasian    1587        High
## 104        Caucasian    1050        High
## 105        Caucasian     745 Medium-high
## 106        Caucasian     210  Medium_low
## 107            Asian       0         Low
## 108            Asian       0  Medium_low
## 109        Caucasian     227        High
## 110            Asian     297  Medium_low
## 111            Asian      47  Medium_low
## 112 African American       0         Low
## 113        Caucasian    1046        High
## 114            Asian     768        High
## 115        Caucasian     271  Medium_low
## 116 African American     510 Medium-high
## 117        Caucasian       0         Low
## 118            Asian    1341        High
## 119        Caucasian       0         Low
## 120        Caucasian       0         Low
## 121            Asian       0         Low
## 122        Caucasian     454        High
## 123        Caucasian     904        High
## 124 African American       0         Low
## 125        Caucasian       0         Low
## 126        Caucasian       0         Low
## 127            Asian    1404 Medium-high
## 128        Caucasian       0  Medium_low
## 129 African American    1259        High
## 130 African American     255  Medium_low
## 131 African American     868  Medium_low
## 132            Asian       0         Low
## 133 African American     912 Medium-high
## 134 African American    1018 Medium-high
## 135 African American     835 Medium-high
## 136 African American       8  Medium_low
## 137 African American      75  Medium_low
## 138            Asian     187  Medium_low
## 139        Caucasian       0         Low
## 140 African American    1597        High
## 141 African American    1425        High
## 142        Caucasian     605        High
## 143            Asian     669        High
## 144 African American     710 Medium-high
## 145        Caucasian      68         Low
## 146            Asian     642 Medium-high
## 147        Caucasian     805 Medium-high
## 148        Caucasian       0         Low
## 149        Caucasian       0         Low
## 150            Asian       0         Low
## 151 African American     581 Medium-high
## 152        Caucasian     534 Medium-high
## 153        Caucasian     156         Low
## 154        Caucasian       0  Medium_low
## 155 African American       0  Medium_low
## 156            Asian       0         Low
## 157        Caucasian     429  Medium_low
## 158        Caucasian    1020 Medium-high
## 159            Asian     653 Medium-high
## 160            Asian       0         Low
## 161        Caucasian     836 Medium-high
## 162        Caucasian       0         Low
## 163        Caucasian    1086        High
## 164 African American       0         Low
## 165            Asian     548        High
## 166        Caucasian     570  Medium_low
## 167 African American       0         Low
## 168        Caucasian       0         Low
## 169            Asian       0         Low
## 170        Caucasian    1099        High
## 171            Asian       0         Low
## 172        Caucasian     283  Medium_low
## 173            Asian     108 Medium-high
## 174 African American     724 Medium-high
## 175 African American    1573        High
## 176        Caucasian       0         Low
## 177        Caucasian       0         Low
## 178            Asian     384  Medium_low
## 179        Caucasian     453  Medium_low
## 180        Caucasian    1237        High
## 181            Asian     423  Medium_low
## 182 African American     516 Medium-high
## 183 African American     789        High
## 184            Asian       0  Medium_low
## 185        Caucasian    1448        High
## 186 African American     450  Medium_low
## 187 African American     188  Medium_low
## 188            Asian       0         Low
## 189 African American     930        High
## 190        Caucasian     126  Medium_low
## 191        Caucasian     538 Medium-high
## 192            Asian    1687        High
## 193            Asian     336  Medium_low
## 194        Caucasian    1426        High
## 195 African American       0         Low
## 196 African American     802 Medium-high
## 197 African American     749        High
## 198 African American      69  Medium_low
## 199 African American       0         Low
## 200            Asian     571  Medium_low
## 201 African American     829 Medium-high
## 202        Caucasian    1048        High
## 203        Caucasian       0         Low
## 204 African American    1411        High
## 205        Caucasian     456 Medium-high
## 206        Caucasian     638  Medium_low
## 207        Caucasian       0         Low
## 208        Caucasian    1216  Medium_low
## 209 African American     230  Medium_low
## 210 African American     732        High
## 211        Caucasian      95  Medium_low
## 212        Caucasian     799 Medium-high
## 213        Caucasian     308  Medium_low
## 214 African American     637 Medium-high
## 215        Caucasian     681 Medium-high
## 216        Caucasian     246  Medium_low
## 217            Asian      52         Low
## 218        Caucasian     955 Medium-high
## 219            Asian     195         Low
## 220        Caucasian     653 Medium-high
## 221            Asian    1246 Medium-high
## 222 African American    1230        High
## 223        Caucasian    1549        High
## 224            Asian     573  Medium_low
## 225        Caucasian     701        High
## 226            Asian    1075 Medium-high
## 227        Caucasian    1032        High
## 228        Caucasian     482 Medium-high
## 229 African American     156 Medium-high
## 230            Asian    1058        High
## 231 African American     661 Medium-high
## 232        Caucasian     657 Medium-high
## 233            Asian     689 Medium-high
## 234        Caucasian       0         Low
## 235 African American    1329        High
## 236 African American     191         Low
## 237            Asian     489 Medium-high
## 238            Asian     443 Medium-high
## 239        Caucasian      52         Low
## 240            Asian     163  Medium_low
## 241            Asian     148  Medium_low
## 242 African American       0         Low
## 243            Asian      16         Low
## 244        Caucasian     856        High
## 245        Caucasian       0         Low
## 246 African American       0         Low
## 247            Asian     199  Medium_low
## 248        Caucasian       0         Low
## 249        Caucasian       0         Low
## 250        Caucasian      98         Low
## 251            Asian       0         Low
## 252        Caucasian     132  Medium_low
## 253        Caucasian    1355        High
## 254 African American     218 Medium-high
## 255        Caucasian    1048        High
## 256 African American     118  Medium_low
## 257            Asian       0         Low
## 258            Asian       0         Low
## 259        Caucasian       0         Low
## 260        Caucasian    1092        High
## 261            Asian     345 Medium-high
## 262            Asian    1050        High
## 263        Caucasian     465  Medium_low
## 264 African American     133  Medium_low
## 265        Caucasian     651 Medium-high
## 266 African American     549 Medium-high
## 267        Caucasian      15 Medium-high
## 268            Asian     942 Medium-high
## 269        Caucasian       0         Low
## 270        Caucasian     772 Medium-high
## 271        Caucasian     136         Low
## 272        Caucasian     436 Medium-high
## 273            Asian     728  Medium_low
## 274            Asian    1255 Medium-high
## 275 African American     967 Medium-high
## 276        Caucasian     529        High
## 277        Caucasian     209  Medium_low
## 278        Caucasian     531 Medium-high
## 279        Caucasian     250        High
## 280        Caucasian     269         Low
## 281 African American     541 Medium-high
## 282 African American       0         Low
## 283        Caucasian    1298        High
## 284        Caucasian     890        High
## 285        Caucasian       0         Low
## 286            Asian       0         Low
## 287        Caucasian       0         Low
## 288        Caucasian       0  Medium_low
## 289            Asian     863 Medium-high
## 290        Caucasian     485  Medium_low
## 291            Asian     159  Medium_low
## 292        Caucasian     309  Medium_low
## 293        Caucasian     481 Medium-high
## 294 African American    1677        High
## 295            Asian       0         Low
## 296        Caucasian       0         Low
## 297        Caucasian     293 Medium-high
## 298            Asian     188  Medium_low
## 299 African American       0         Low
## 300 African American     711 Medium-high
## 301        Caucasian     580 Medium-high
## 302        Caucasian     172  Medium_low
## 303        Caucasian     295 Medium-high
## 304 African American     414 Medium-high
## 305            Asian     905        High
## 306            Asian       0         Low
## 307        Caucasian      70  Medium_low
## 308            Asian       0  Medium_low
## 309            Asian     681 Medium-high
## 310        Caucasian     885        High
## 311 African American    1036 Medium-high
## 312        Caucasian     844 Medium-high
## 313        Caucasian     823 Medium-high
## 314            Asian     843        High
## 315 African American    1140        High
## 316        Caucasian     463  Medium_low
## 317            Asian    1142        High
## 318        Caucasian     136 Medium-high
## 319        Caucasian       0         Low
## 320 African American       0         Low
## 321 African American       5         Low
## 322        Caucasian      81  Medium_low
## 323        Caucasian     265  Medium_low
## 324        Caucasian    1999        <NA>
## 325            Asian     415         Low
## 326        Caucasian     732 Medium-high
## 327        Caucasian    1361        High
## 328            Asian     984        High
## 329        Caucasian     121  Medium_low
## 330        Caucasian     846        High
## 331            Asian    1054  Medium_low
## 332        Caucasian     474  Medium_low
## 333        Caucasian     380 Medium-high
## 334 African American     182  Medium_low
## 335        Caucasian     594 Medium-high
## 336        Caucasian     194  Medium_low
## 337        Caucasian     926        High
## 338 African American       0         Low
## 339        Caucasian     606 Medium-high
## 340 African American    1107        High
## 341 African American     320  Medium_low
## 342        Caucasian     426  Medium_low
## 343 African American     204         Low
## 344        Caucasian     410  Medium_low
## 345            Asian     633 Medium-high
## 346        Caucasian       0         Low
## 347            Asian     907 Medium-high
## 348        Caucasian    1192        High
## 349        Caucasian       0         Low
## 350            Asian     503 Medium-high
## 351        Caucasian       0         Low
## 352        Caucasian     302 Medium-high
## 353 African American     583        High
## 354        Caucasian     425 Medium-high
## 355            Asian     413  Medium_low
## 356 African American    1405        High
## 357        Caucasian     962        High
## 358            Asian       0         Low
## 359        Caucasian     347  Medium_low
## 360 African American     611  Medium_low
## 361        Caucasian     712        High
## 362            Asian     382 Medium-high
## 363 African American     710 Medium-high
## 364        Caucasian     578        High
## 365            Asian    1243        High
## 366        Caucasian     790        High
## 367        Caucasian    1264        High
## 368 African American     216  Medium_low
## 369        Caucasian     345 Medium-high
## 370        Caucasian    1208        High
## 371        Caucasian     992        High
## 372        Caucasian       0         Low
## 373        Caucasian     840  Medium_low
## 374        Caucasian    1003 Medium-high
## 375        Caucasian     588 Medium-high
## 376        Caucasian    1000 Medium-high
## 377 African American     767        High
## 378        Caucasian       0         Low
## 379        Caucasian     717 Medium-high
## 380        Caucasian       0         Low
## 381 African American     661        High
## 382        Caucasian     849        High
## 383        Caucasian    1352 Medium-high
## 384        Caucasian     382  Medium_low
## 385 African American       0         Low
## 386            Asian     905 Medium-high
## 387 African American     371  Medium_low
## 388            Asian       0         Low
## 389        Caucasian    1129        High
## 390        Caucasian     806        High
## 391            Asian    1393        High
## 392        Caucasian     721        High
## 393            Asian       0         Low
## 394 African American       0         Low
## 395        Caucasian     734 Medium-high
## 396        Caucasian     560  Medium_low
## 397 African American     480  Medium_low
## 398        Caucasian     138  Medium_low
## 399        Caucasian       0         Low
## 400            Asian     966 Medium-high
Credit %>% filter(is.na(Credit$Type)) 
##    ID Income Limit Rating Cards Age Education Gender Student Married Ethnicity
## 1 324 182.73 13913    982     4  98        17   Male      No     Yes Caucasian
##   Balance Type
## 1    1999 <NA>
# Oops, there is atleast one datapoint that has NA for `Type`. Need to figure that out. 

And now to visualize

 gf_smooth(Balance~Income, data = Credit, method = "lm") %>% 
gf_point(Balance ~Income,data = Credit %>% filter(!is.na(Type)), color = ~Type) %>% 
  gf_smooth(Balance ~Income,data = Credit %>% filter(!is.na(Type)), color = ~Type,method = "lm")

As can be seen, if we split up the individuals into groups based on Income, then not all groups have a positive correlation with Balance, but the total group does!

Whereas in aggregate there is an overall positive relationship, when broken down we now see that for the low (red points), medium-low (green points), and medium-high (blue points) income bracket groups, the strong positive relationship between credit card balance and income disappears! Only for the high bracket does the relationship stay somewhat positive. In this example, credit limit is a confounding variable for credit card Balance and Income.

###Experimenting for Simpson’s Paradox

n <- 50
df <- tibble(x = seq(1,n,1), y = 0.2 * x - rnorm(n, mean = 0, sd = 16), z = sample(c("a","b", "c","d"),size = n, replace = TRUE))

gf_smooth(y ~ x, data = df, method = "lm",color = "black") %>% 
gf_point(y ~ x, color = ~ z, data = df) %>% 
  gf_smooth(y ~ x, color = ~z, data = df,method = "lm")

Not quite a good demonstration, need to search the web.

7 Sampling

tactile_prop_red: Dataset from moderndive , of Hand-sampling of a set of balls to record the proportion of red balls.

tactile_prop_red
## # A tibble: 33 x 4
##    group            replicate red_balls prop_red
##    <chr>                <int>     <int>    <dbl>
##  1 Ilyas, Yohan             1        21     0.42
##  2 Morgan, Terrance         2        17     0.34
##  3 Martin, Thomas           3        21     0.42
##  4 Clark, Frank             4        21     0.42
##  5 Riddhi, Karina           5        18     0.36
##  6 Andrew, Tyler            6        19     0.38
##  7 Julia                    7        19     0.38
##  8 Rachel, Lauren           8        11     0.22
##  9 Daniel, Caroline         9        15     0.3 
## 10 Josh, Maeve             10        17     0.34
## # … with 23 more rows
gf_histogram(~prop_red,data = tactile_prop_red)

This is a sampling activity (like Shubha) and we can see sampling variation in the readings obtained by each pair of students.

7.1 Virtual Sampling

bowl #A mixture of red and white balls.
## # A tibble: 2,400 x 2
##    ball_ID color
##      <int> <chr>
##  1       1 white
##  2       2 white
##  3       3 white
##  4       4 red  
##  5       5 white
##  6       6 white
##  7       7 red  
##  8       8 white
##  9       9 red  
## 10      10 white
## # … with 2,390 more rows
virtual_shovel <- bowl %>% 
  moderndive::rep_sample_n(size = 50)
virtual_shovel
## # A tibble: 50 x 3
## # Groups:   replicate [1]
##    replicate ball_ID color
##        <int>   <int> <chr>
##  1         1     996 white
##  2         1    1196 white
##  3         1     501 red  
##  4         1     471 white
##  5         1     387 white
##  6         1    1339 white
##  7         1    1634 red  
##  8         1    1211 white
##  9         1    2280 white
## 10         1    1465 red  
## # … with 40 more rows
virtual_shovel %>% 
  mutate(is_red = (color == "red")) %>%
  summarize(prop_red = mean(is_red))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 2
##   replicate prop_red
##       <int>    <dbl>
## 1         1     0.48
# Can also do
virtual_shovel %>% 
  summarise(num_red = sum(color == "red")) %>%
  mutate(prop_red = num_red/50)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 3
##   replicate num_red prop_red
##       <int>   <int>    <dbl>
## 1         1      24     0.48

7.2 Repeated Virtual Sampling

We can repeat the sampling more than once.

virtual_samples <- bowl %>% 
  rep_sample_n(size = 50, reps = 33,replace = TRUE)
virtual_samples
## # A tibble: 1,650 x 3
## # Groups:   replicate [33]
##    replicate ball_ID color
##        <int>   <int> <chr>
##  1         1    1759 white
##  2         1    2125 white
##  3         1      11 white
##  4         1    1409 white
##  5         1     156 white
##  6         1     924 red  
##  7         1     764 white
##  8         1     549 white
##  9         1    1374 red  
## 10         1     745 white
## # … with 1,640 more rows
# Weird that it shows `20` as the max `replicate`. However `tail(virtual_samples) shows things correctly. 

Observe that while the first 50 rows of replicate are equal to 1, the next 50 rows of replicate are equal to 2. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 correspond to the second sample of 50 balls. This pattern continues for all reps = 33 replicates and thus virtual_samples has 33 × 50 = 1650 rows.

virtual_prop_red <- 
  virtual_samples %>% 
  group_by(replicate) %>%
  summarise(prop_red = sum(color == "red")/50)
## `summarise()` ungrouping output (override with `.groups` argument)
virtual_prop_red # OK! HERE there are 33 rows! But weird!
## # A tibble: 33 x 2
##    replicate prop_red
##        <int>    <dbl>
##  1         1     0.34
##  2         2     0.34
##  3         3     0.46
##  4         4     0.4 
##  5         5     0.36
##  6         6     0.4 
##  7         7     0.34
##  8         8     0.42
##  9         9     0.32
## 10        10     0.44
## # … with 23 more rows
# Visualisation
virtual_prop_red %>% 
  gf_histogram(~prop_red)

We can do this a 1000 times:

7.2.1 One Thousand Replicates

virtual_samples <- 
  rep_sample_n(bowl,size = 50, reps = 1000)

virtual_samples %>% 
  summarise(num_red = sum(color == "red"), n = n(), prop_red = num_red/n) %>% 
  gf_histogram(~ prop_red,color = "white",binwidth = 0.01)
## `summarise()` ungrouping output (override with `.groups` argument)

7.2.2 Varying shovel size !

Let’s use rep_sample_n() with size = 25, size = 50, and size = 100, while keeping the number of repeated/replicated samples at 1000: - Virtually use the appropriate shovel to generate 1000 samples with size balls. - Compute the resulting 1000 replicated of the proportion of the shovel’s balls that are red. - Visualize the distribution of these 1000 proportion red using a histogram. - Then compare the three resulting histograms.

# Shovel 25
n = 25
virtual_samples_25 <- bowl %>% 
  rep_sample_n(size = n,reps = 1000,replace = TRUE) %>% 
  summarise(num_red = sum(color == "red"), count = n(), prop_red = num_red/count)
## `summarise()` ungrouping output (override with `.groups` argument)
sd_25 <- sd(virtual_samples_25$prop_red)
  gf_histogram(~prop_red, data = virtual_samples_25,color = "white",binwidth = 0.02, boundary = 0.4)

# Shovel 50
n = 50
virtual_samples_50 <- bowl %>% 
  rep_sample_n(size = n,reps = 1000,replace = TRUE) %>% 
  summarise(num_red = sum(color == "red"), count = n(), prop_red = num_red/count)
## `summarise()` ungrouping output (override with `.groups` argument)
sd_50 <- sd(virtual_samples_50$prop_red)
  gf_histogram(~prop_red, data = virtual_samples_50,color = "white",binwidth = 0.02, boundary = 0.4)

# Shovel 100
n = 100
virtual_samples_100 <- bowl %>% 
  rep_sample_n(size = n,reps = 1000,replace = TRUE) %>% 
  summarise(num_red = sum(color == "red"), count = n(), prop_red = num_red/count)
## `summarise()` ungrouping output (override with `.groups` argument)
sd_100 <- sd(virtual_samples_100$prop_red)
  gf_histogram(~prop_red, data = virtual_samples_100,color = "white",binwidth = 0.02, boundary = 0.4)

sd_25;sd_50;sd_100
## [1] 0.098561
## [1] 0.066643
## [1] 0.048238

Can we plot the three histograms one on top of the other?

gf_histogram(
  ~ prop_red,
  data = virtual_samples_25,
  fill = "grey50",
  color = "white",
  binwidth = 0.02
) %>%
  gf_histogram(
    ~ prop_red,
    data = virtual_samples_50,
    fill = "salmon",
    color = "white",
    binwidth = 0.02
  ) %>%
  gf_histogram(
    ~ prop_red,
    data = virtual_samples_100,
    fill = "dodgerblue",
    color = "white",
    binwidth = 0.02
  )

So as the number of slots in the shovel increased, this standard deviation decreased. These types of standard deviations have another special name: standard errors; they quantify the effect of sampling variation induced on our estimates.

However despite this sampling variation, our sample proportions \(\hat p\) were always centered around the true population proportion. This is also known as having an accurate estimate.

In our simulations, as the sample size increases, the spread/variation of our sample proportions \(\hat p\) around the true population proportion p decreases. You can observe this behavior as well in Figure 8.13. This is also known as having a more precise estimate.

Look at Accuracy and Precision

Once again, let’s revisit the sampling paradigm:
- If the sampling of a sample of size n is done at random, then - the sample is unbiased and representative of the population of size N, thus - any result based on the sample can generalize to the population, thus - the point estimate is a “good guess” of the unknown population parameter, thus - instead of performing a census, we can infer about the population using sampling.

What we did was to demonstrate a very famous theorem, or mathematically proven truth, called the Central Limit Theorem. It loosely states that when sample means and sample proportions are based on larger and larger sample sizes, the sampling distribution of these two point estimates become more and more normally shaped and more and more narrow.

8 Confidence Intervals

TABLE 7.1: Scenarios of sampling for inference

Scenario Population parameter Notation Point estimate Notation.
1 Population proportion p Sample proportion \(\hat p\)
2 Population mean μ Sample mean μ or \(\hat x\)
3 Difference in population proportions p1−p2 Difference in sample proportions \(\hat p_1− \hat p_2\)
4 Difference in population means \(μ_1 − μ_2\) Difference in sample means \(x_1 -x_2\)
5 Population regression slope \(β_1\) Sample regression slope \(\hatβ_1\) or \(b_1\)
6 Population regression intercept \(β_0\) Sample regression intercept \(\hat β_0\) or \(b_0\)
7 Population Standard Deviation $$ Sample Standard Deviation $$

In most cases, we don’t have the population values as we did with the bowl of balls. We only have a single sample of data from a larger population. We’d like to be able to make some reasonable guesses about population parameters using that single sample to create a range of plausible values for a population parameter. This range of plausible values is known as a confidence interval and will be the focus of this chapter.

And how do we use a single sample to get some idea of how other samples might vary in terms of their statistic values? One common way this is done is via a process known as bootstrapping that will be the focus of the beginning sections of this chapter.

8.1 Population Stats: What we want to know

# Original Population
# Visualise
pennies %>% 
  gf_histogram(~age_in_2011, color = "white", fill = "black", title = "Population")

# Calculate
pennies %>%
  summarise(
    mean = mean(age_in_2011),
    median = median(age_in_2011),
    sd = sd(age_in_2011)
  )
## # A tibble: 1 x 3
##    mean median    sd
##   <dbl>  <dbl> <dbl>
## 1  21.2     20  12.4

Pennies is slightly right-skewed with age_in_2011. Mean is slightly to the right of the median. Quite some pennies that are >40 years old.

8.2 Bootstrapping

When we have only one sample, pennies_sample. We wish to find the mean age of all pennies minted in the US, based on this one sample.

pennies_sample <- sample_n(pennies, size = 50, replace = FALSE)
pennies_sample
## # A tibble: 50 x 2
##     year age_in_2011
##    <int>       <int>
##  1  1997          14
##  2  1976          35
##  3  2007           4
##  4  1997          14
##  5  2002           9
##  6  2000          11
##  7  1977          34
##  8  1993          18
##  9  2003           8
## 10  1982          29
## # … with 40 more rows
#EDA
pennies_sample %>% 
gf_histogram(~ age_in_2011, bins = 10,color = "white", data = pennies_sample)

# Roughly symmetric.

# Calculate `sample mean`
x_bar <- pennies_sample %>% 
  summarise(mean(age_in_2011))
x_bar
## # A tibble: 1 x 1
##   `mean(age_in_2011)`
##                 <dbl>
## 1                20.2

Bootstrapping is a method of generating fresh samples from an existing single sample, by sampling with replacement . Sample size with bootstrapping is the same as the that of the original sample.

# We generate one bootstrap sample from `pennies_sample`
bootstrap_sample1 <- 
  pennies_sample %>%
  rep_sample_n(size = 40, replace = TRUE, reps = 1)
bootstrap_sample1
## # A tibble: 40 x 3
## # Groups:   replicate [1]
##    replicate  year age_in_2011
##        <int> <int>       <int>
##  1         1  1984          27
##  2         1  2006           5
##  3         1  1998          13
##  4         1  1969          42
##  5         1  1975          36
##  6         1  1979          32
##  7         1  1990          21
##  8         1  1997          14
##  9         1  1997          14
## 10         1  1975          36
## # … with 30 more rows
#Bootstrap Stat
# We can calculate **bootstrap_statistics** for this sample.
bootstrap_sample1 %>% 
  summarise(stat = mean(age_in_2011))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 2
##   replicate  stat
##       <int> <dbl>
## 1         1  22.5
#Visualize this bootstrap sample
gf_histogram(~age_in_2011, color = "white", bins = 20, data = bootstrap_sample1) %>% 
  gf_vline(xintercept = ~ mean(bootstrap_sample1$age_in_2011), color = "red") %>% 
  gf_vline(data = pennies, xintercept = ~ mean(pennies$age_in_2011), color = "green")
## Warning: Use of `pennies$age_in_2011` is discouraged. Use `age_in_2011` instead.

# This doesn't work...weird behaviour of `ggformula`
# gf_vline(xintercept = mean(age_in_2011), data = bootstrap_sample1)
# Now it does: use the formula `~` symbol in front of all parameters. 

This mean (red line) is different from the sample mean we calculated earlier (green line).

This is another sample, one calculated using bootstrap, that we could assume comes from the population of interest.

# More bootstrap samples
six_bootstrap_samples <- 
  pennies_sample %>%
  rep_sample_n(size = 40,replace = TRUE, reps = 6)

#Visualise
gf_histogram(~ age_in_2011 | replicate, data = six_bootstrap_samples, color = "white",fill = "black") %>% 
  gf_vline(data = six_bootstrap_samples, 
           xintercept = ~mean(six_bootstrap_samples$age_in_2011), 
           color = "red") %>% # Vline for the sample mean
  gf_vline(data = pennies, 
           xintercept = ~ mean(pennies$age_in_2011), # Note the `tilde`
           color = "green") %>% # Vline for the overall bootstrap mean
  gf_vline(data = (six_bootstrap_samples %>% 
      group_by(replicate) %>% 
      summarise(mean = mean(age_in_2011))),xintercept = ~ mean, color = "purple", linetype = "dashed") 
## `summarise()` ungrouping output (override with `.groups` argument)

# What a command I have cooked here! Shows the per-replicate sample-mean in a plot facetted by replicate!

gf_histogram(~ age_in_2011, data = six_bootstrap_samples) %>% 
  gf_vline(data = six_bootstrap_samples, 
           xintercept = ~ mean(six_bootstrap_samples$age_in_2011), # tilde
           color = "red") %>% 
  gf_vline(data = pennies, 
           xintercept = ~ mean(pennies$age_in_2011), # Note the `tilde`
           color = "green")
## Warning: Use of `pennies$age_in_2011` is discouraged. Use `age_in_2011` instead.

# Calculate six means
six_bootstrap_samples %>% 
  group_by(replicate) %>% 
  summarise(stat = mean(age_in_2011))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
##   replicate  stat
##       <int> <dbl>
## 1         1  22.1
## 2         2  21.4
## 3         3  19.9
## 4         4  20  
## 5         5  20.9
## 6         6  19.8

8.3 Using the Infer package

We see that the calculated statistic stat varies from replicate to replicate and has a distribution of its own. So, instead of doing this just six times, we can replicate this a 1000 times or more, over and over and look at the distribution of ’stat`.

We can do all this using dplyr-like verbs from the package infer.

  • uses the pipe

So the SAMPLING flow withinfer is:

specify(response ~ explanatory) %>% generate(reps) %>% calculate(stat) %>% visualise().

8.3.1 specify

Chooses the variable which will be the focus of the statistical inference. We can specify which variable is explanatory and which is a response.

pennies_sample %>% 
  infer::specify(response = age_in_2011)
## Response: age_in_2011 (integer)
## # A tibble: 50 x 1
##    age_in_2011
##          <int>
##  1          14
##  2          35
##  3           4
##  4          14
##  5           9
##  6          11
##  7          34
##  8          18
##  9           8
## 10          29
## # … with 40 more rows
# Or using `formula` notation as with `ggformula`, `y ~ x`
pennies_sample %>% 
  specify(age_in_2011 ~ NULL)
## Response: age_in_2011 (integer)
## # A tibble: 50 x 1
##    age_in_2011
##          <int>
##  1          14
##  2          35
##  3           4
##  4          14
##  5           9
##  6          11
##  7          34
##  8          18
##  9           8
## 10          29
## # … with 40 more rows

8.3.2 Generate replicates using generate()

thousand_bootstrap_samples <- 
  pennies_sample %>%
  specify(age_in_2011 ~ NULL) %>%
  generate(size = 40,reps = 1000, type = "bootstrap")
thousand_bootstrap_samples
## Response: age_in_2011 (integer)
## # A tibble: 50,000 x 2
## # Groups:   replicate [1,000]
##    replicate age_in_2011
##        <int>       <int>
##  1         1          29
##  2         1          14
##  3         1           3
##  4         1           3
##  5         1          10
##  6         1          37
##  7         1          34
##  8         1          36
##  9         1          42
## 10         1          10
## # … with 49,990 more rows

8.3.3 Calculating Sample Stats using calculate()

This collapses the samples to one mean per replicate.

bootstrap_distribution <- pennies_sample %>% 
  mutate(age_in_2011 = 2011 - year) %>% 
  specify(age_in_2011 ~ NULL) %>% 
  generate(size = 40, reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

bootstrap_distribution # A thousand rows with means
## # A tibble: 1,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1  23.2
##  2         2  20.5
##  3         3  19.3
##  4         4  16.9
##  5         5  16.8
##  6         6  18.3
##  7         7  23.4
##  8         8  19.9
##  9         9  20  
## 10        10  19.8
## # … with 990 more rows

8.3.4 Visualize

# My way
pennies_sample %>% 
  mutate(age_in_2011 = 2011 - year) %>% 
  specify(age_in_2011 ~ NULL) %>% 
  generate(size = 40, reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean") %>% 
  gf_histogram(~stat,color = "white", fill = "black",
               title = "Bootstrap Distribution",
               subtitle = "Plotted with ggformula")

#Even better with `infer`! Cool !!!
bootstrap_distribution %>% 
  visualise()

# In one command
pennies_sample %>% 
  specify(age_in_2011 ~ NULL) %>% 
  generate(size = 40, reps = 1000,type = "bootstrap") %>% 
  calculate(stat = "mean") %>% 
  visualise()

# Furthermore
pennies_sample %>% 
  specify(age_in_2011 ~ NULL) %>% 
  generate(size = 40, reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean") %>% 
  summarise(mean_of_means = mean(stat))
## # A tibble: 1 x 1
##   mean_of_means
##           <dbl>
## 1          20.2
# Compare Pop_mean with Bootstrap_mean
pennies %>% 
  summarise(mean = mean(age_in_2011), median = median(age_in_2011),sd = sd(age_in_2011))
## # A tibble: 1 x 3
##    mean median    sd
##   <dbl>  <dbl> <dbl>
## 1  21.2     20  12.4

8.3.5 Confidence Intervals

  • give a set of plausible values for a parameter that has been estimated, depending upon confidence levels that are desired/specified.

the bootstrap distribution provides us a guess as to what the variability in different sample means may look like, only using the original sample as our guide. We can quantify this variability in the form of a 95% confidence interval in a couple different ways.

  • Percentile method
  • Standard Error method when the (bootstrap) sampling distribution is normal
# What we did earlier with `calculate()`:

bootstrap_distribution <-
  pennies_sample %>%
  specify(age_in_2011 ~ NULL) %>%
  generate(size = 40, reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")
 
# Confidence Intervals - Percentile
percentile_ci <- 
  bootstrap_distribution %>% 
  get_ci(level = 0.95,type = "percentile")
percentile_ci
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1   17.2    23.4
#Visualise bootstrap percentile CIs
bootstrap_distribution %>% 
  visualise() + shade_confidence_interval(endpoints = percentile_ci)

# Does NOT work with the pipe !!

# A combination piped command does not work...
# pennies_sample %>%
#   specify(age_in_2011 ~ NULL) %>%
#   generate(size = 40, reps = 1000, type = "bootstrap") %>%
#   calculate(stat = "mean") %>% # bootstrap distribution
#   get_ci(level = 0.95, type = "percentile") %>%
#   visualise(endpoints = ., endpoints_color = "purple",direction = "between")
# Does not work. Why? `Visualise` needs a distribution to plot; `get_ci` gives two numbers. That's why. 

percentile works with any sort of bootstrap distribution.

If the bootstrap distribution is *close to symmetric and bell-shaped*, we can also use a shortcut formula for determining the lower and upper endpoints of the confidence interval. This is done by using the formula $\bar x ±(multiplier∗SE)$, where $\bar x$ is our original sample mean and `SE` stands for standard error and corresponds to the standard deviation of the bootstrap distribution. The value of multiplier here is the **appropriate percentile** of the **standard normal distribution**.
# So we can do the same using SE
# We use this when the bootstrap distribution is near-normal
# Use normal percentiles to specify CI limits
# setting type = "se" does this

# recall that x_bar was the sample mean, mean(pennies_sample)
# 
se_ci <- bootstrap_distribution %>% 
  get_ci(type = "se", point_estimate = x_bar)
se_ci
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1  16.9  23.5
# Recall what we did with Calculate: `sample mean` x_bar and the bootstrap distribution
# x_bar <- pennies_sample %>% summarise(mean(age_in_2011))
# AND
# bootstrap_distribution <- 
#   pennies_sample %>% 
#   specify(age_in_2011 ~ NULL) %>% 
#   generate(size = 40, reps = 1000, type = "bootstrap") %>% 
#   calculate(stat = "mean")

bootstrap_distribution %>% 
  visualise() + shade_confidence_interval(endpoints = se_ci)

#Combined command does not work...why? Now we understand !
# pennies_sample %>% 
#   specify(age_in_2011 ~ NULL) %>%
#   generate(reps = 1000, type = "bootstrap") %>%
#   calculate(stat = "mean") %>%
#   get_ci(type = "se",point_estimate = x_bar) %>%
#   visualise(direction = "between")

8.3.6 Comparing Sampling and Bootstrap means, medians

In sampling we assume we have access to the population. With bootstrap, we have just one sample to work with.

# "True" Sampling ( i.e. without replacement)
# thousand_samples <- pennies %>% # Note the access to the population
#   rep_sample_n(size = 40, reps = 1000, replace = FALSE)

sampling_distribution <- pennies %>% 
  # Note the access to the population
  rep_sample_n(size = 40, reps = 1000, replace = FALSE) %>% 
  group_by(replicate) %>% 
  summarise(stat = mean(age_in_2011))
## `summarise()` ungrouping output (override with `.groups` argument)
sampling_distribution
## # A tibble: 1,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1  19.6
##  2         2  21.8
##  3         3  20.2
##  4         4  23.0
##  5         5  21.8
##  6         6  22.0
##  7         7  19.5
##  8         8  21.0
##  9         9  19.2
## 10        10  21.2
## # … with 990 more rows
sampling_distribution %>% 
  gf_histogram(~ stat, color = "white", fill = "salmon", bins = 10,title = "Sampling Distribution with ggformula")

# And the final estimate
final_sampling_calcs <- 
sampling_distribution %>% 
  summarise(mean = mean(stat), se = sd(stat))
final_sampling_calcs
## # A tibble: 1 x 2
##    mean    se
##   <dbl> <dbl>
## 1  21.3  1.89
Note that this `sampling stat` i.e. `mean` is pretty closely matching that of the original population. This is an artifact of the **Central Limit Theorem**, of course. 
# Sampling with Bootstrap
# What we did earlier:
# 
# bootstrap_distribution <- 
#   pennies_sample %>% 
#   specify(age_in_2011 ~ NULL) %>% 
#   generate(size = 40, reps = 1000, type = "bootstrap") %>% 
#   calculate(stat = "mean")
#   
bootstrap_distribution %>% 
  visualise(bins = 10, fill = "blue")

# Calculate "se'
final_bootstrap_calcs <- bootstrap_distribution %>% 
  summarise(mean = mean(stat),se = sd(stat))

results <- rbind(final_sampling_calcs,
final_bootstrap_calcs)
results
## # A tibble: 2 x 2
##    mean    se
##   <dbl> <dbl>
## 1  21.3  1.89
## 2  20.1  1.69

While the se are similar, the means are different when comparing sampling and bootstrap. Since the bootstrap distribution is centered at the original sample mean, it doesn’t necessarily provide a good estimate of the overall population mean μ.

8.3.7 Interpreting the Confidence Interval

We saw the two sets of CIs, bootstrap and Sampling:

## Bootstrap
## Let's try a new "single" sample:
pennies_sample2 <- 
  pennies %>% 
  sample_n(size = 40)

pennies_sample2 %>% 
  specify(age_in_2011~ NULL) %>% 
  generate(reps = 1000,type = "bootstrap") %>% 
  calculate(stat = "mean") %>% 
  get_ci()
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1   17.2    25.1

This is one set of bootstrap_ci, and we see that the **population mean* falls within these limits. We want to know if this always happens. We can replicate this a 100 times and check if the pop_mean lies within the 100 bootstrap ci limits.

Using the original “shovel” problem for red and white balls , we try to establish confidence intervals for the proportion of the red balls.

tactile_shovel1 <- 
  tibble(color =  c(rep("red",21), rep("white",29)))
tactile_shovel1
## # A tibble: 50 x 1
##    color
##    <chr>
##  1 red  
##  2 red  
##  3 red  
##  4 red  
##  5 red  
##  6 red  
##  7 red  
##  8 red  
##  9 red  
## 10 red  
## # … with 40 more rows
# Observed Statistic (see mosaic app note)
p_hat <- 
  tactile_shovel1 %>% 
  specify(color ~ NULL, success = "red") %>% 
  calculate(stat = "prop")
p_hat
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  0.42
# Bootstrap Samples
bootstrap_props <- 
  tactile_shovel1 %>% 
  specify(color ~ NULL, success = "red") %>% 
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "prop")

ci_bootstrap <- bootstrap_props %>% 
  get_ci()
ci_bootstrap
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1   0.28    0.56
bootstrap_props %>% 
  visualise() + shade_confidence_interval(endpoints = ci_bootstrap)

Since the resultant distribution is bell shaped ( on account of 10000 samples) we can choose either method for confidence interval calculation.

std_error_ci <- 
  bootstrap_props %>% 
  get_ci(level = 0.95, type = "se", point_estimate = p_hat)
std_error_ci
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.283 0.557
bootstrap_props %>% 
  visualise() + shade_confidence_interval(endpoints = std_error_ci,endpoints_color = "green")
## Warning: Ignoring unknown parameters: endpoints_colour

## Warning: Ignoring unknown parameters: endpoints_colour

### Theory-based Confidence Intervals

To construct a theory-based confidence interval for p, the unknown true population proportion we - Collect a sample of size n
- Compute \(\hatp\)
- Compute the standard error, \(se\) - Compute Margin of Error MoE as 1.96*se, for 95% Confidence Intervals - Use \(\hat p +/- MoE\) as estimates for our confidence intervals, provided we know that the distribution is bell shaped.

8.3.7.1 One Proportion

tactile_prop_red
## # A tibble: 33 x 4
##    group            replicate red_balls prop_red
##    <chr>                <int>     <int>    <dbl>
##  1 Ilyas, Yohan             1        21     0.42
##  2 Morgan, Terrance         2        17     0.34
##  3 Martin, Thomas           3        21     0.42
##  4 Clark, Frank             4        21     0.42
##  5 Riddhi, Karina           5        18     0.36
##  6 Andrew, Tyler            6        19     0.38
##  7 Julia                    7        19     0.38
##  8 Rachel, Lauren           8        11     0.22
##  9 Daniel, Caroline         9        15     0.3 
## 10 Josh, Maeve             10        17     0.34
## # … with 23 more rows
# 33 replicates of 50 samples out of 2400 white+red balls.
true_p <- 900/2400
conf_ints <- 
  tactile_prop_red %>%
  rename(p_hat = prop_red) %>%
  mutate(
    se = sqrt(p_hat * (1 - p_hat) / 50),
    MoE = 1.96 * se,
    ci_lower = p_hat - MoE,
    ci_upper = p_hat + MoE,
    captured = if_else(true_p >= ci_lower &
                         true_p <= ci_upper, TRUE, FALSE)
  )

conf_ints
## # A tibble: 33 x 9
##    group       replicate red_balls p_hat     se   MoE ci_lower ci_upper captured
##    <chr>           <int>     <int> <dbl>  <dbl> <dbl>    <dbl>    <dbl> <lgl>   
##  1 Ilyas, Yoh…         1        21  0.42 0.0698 0.137    0.283    0.557 TRUE    
##  2 Morgan, Te…         2        17  0.34 0.0670 0.131    0.209    0.471 TRUE    
##  3 Martin, Th…         3        21  0.42 0.0698 0.137    0.283    0.557 TRUE    
##  4 Clark, Fra…         4        21  0.42 0.0698 0.137    0.283    0.557 TRUE    
##  5 Riddhi, Ka…         5        18  0.36 0.0679 0.133    0.227    0.493 TRUE    
##  6 Andrew, Ty…         6        19  0.38 0.0686 0.135    0.245    0.515 TRUE    
##  7 Julia               7        19  0.38 0.0686 0.135    0.245    0.515 TRUE    
##  8 Rachel, La…         8        11  0.22 0.0586 0.115    0.105    0.335 FALSE   
##  9 Daniel, Ca…         9        15  0.3  0.0648 0.127    0.173    0.427 TRUE    
## 10 Josh, Maeve        10        17  0.34 0.0670 0.131    0.209    0.471 TRUE    
## # … with 23 more rows
gf_segment(group + group ~ ci_lower + ci_upper,data = conf_ints, color = ~captured) %>% 
  gf_point(group ~ p_hat,color = ~ captured,data = conf_ints) %>% 
  gf_vline(xintercept = ~ true_p,color = "red")

We can also create the tactile_prop_red using simulation.

virtual_samples <- 
  bowl %>% 
  rep_sample_n(size = 50,reps = 100) %>% 
  group_by(replicate) %>% 
  mutate(p_hat = mean(color == "red"),
         se = sqrt(p_hat * (1 - p_hat) / 50),
    MoE = 1.96 * se,
    ci_lower = p_hat - MoE,
    ci_upper = p_hat + MoE,
    captured = if_else(true_p >= ci_lower &
                         true_p <= ci_upper, TRUE, FALSE))

virtual_samples %>% 
  group_by(replicate) %>% 
  filter(captured == TRUE) %>% 
  nrow()/5000
## [1] 0.97
virtual_samples %>% 
gf_point(replicate ~ p_hat,color = ~ captured,data = virtual_samples,xlab = "Proportion Red",ylab = "Replicate ID") %>% 
gf_segment(replicate + replicate ~ ci_lower + ci_upper,data = virtual_samples, color = ~captured,size = 0.3) %>% 
  gf_vline(xintercept = ~ true_p,color = "red")

8.3.7.2 Comparing Two Proportions

Is yawning contagious?

library(janitor)
mythbusters_yawn
## # A tibble: 50 x 3
##     subj group   yawn 
##    <int> <chr>   <chr>
##  1     1 seed    yes  
##  2     2 control yes  
##  3     3 seed    no   
##  4     4 seed    yes  
##  5     5 seed    no   
##  6     6 control no   
##  7     7 seed    yes  
##  8     8 control no   
##  9     9 control no   
## 10    10 seed    no   
## # … with 40 more rows
# Using `janitor`
mythbusters_yawn %>% 
  tabyl(group,yawn) %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns()
##    group         no        yes
##  control 75.0% (12) 25.0%  (4)
##     seed 70.6% (24) 29.4% (10)

In looking over this problem, we can make note of some important details to include in our infer pipeline: - We are calling a success having a yawn value of “yes”. - Our response variable will always correspond to the variable used in the success so the response variable is yawn. - The explanatory variable is the other variable of interest here: group.

To summarize, we are looking to see the examine the relationship between yawning and whether or not the participant saw a seed yawn or not.

obs_diff <- 
  mythbusters_yawn %>% 
  specify(yawn ~ group,success = "yes") %>% 
  calculate(stat = "diff in props",order = c("seed", "control"))
obs_diff
## # A tibble: 1 x 1
##     stat
##    <dbl>
## 1 0.0441
# Bootstrap Distribution
# We should sample entire rows here since both the yawn and the group are vairables of interest in the model.

bootstrap_distribution <- mythbusters_yawn %>% 
  specify(yawn ~ group, success = "yes") %>% 
  generate(reps = 1000) %>% 
  calculate(stat = "diff in props", order = c("seed", "control"))
## Setting `type = "bootstrap"` in `generate()`.
bootstrap_distribution
## # A tibble: 1,000 x 2
##    replicate     stat
##        <int>    <dbl>
##  1         1  0.0392 
##  2         2  0.0660 
##  3         3 -0.143  
##  4         4  0.127  
##  5         5  0.00794
##  6         6  0.123  
##  7         7 -0.0476 
##  8         8 -0.0441 
##  9         9 -0.146  
## 10        10  0.0595 
## # … with 990 more rows
ci_myth <- bootstrap_distribution %>% 
  get_ci(level = 0.95, type = "percentile")
ci_myth
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1 -0.249   0.294
bootstrap_distribution %>% 
  visualise() + ## ALWAYS USE PLUS SIGN HERE !
  shade_confidence_interval(endpoints = ci_myth, fill = "green", color = "green")

Since the Confidence Intervals include/straddle 0, it is not possible to assert that yawning is contagious.

9 Hypothesis Testing

library(ggplot2movies)
library(broom)

Hypothesis testing may not always be needed; one must perform an EDA on the data first before considering a test.

9.1 Flights NY to SFO and BOS

Suppose we were interested in seeing if the air_time to SFO in San Francisco was statistically greater than the air_time to BOS in Boston.

bos_sfo <- 
  flights %>% 
  na.omit() %>% 
  filter(dest == "SFO" | dest == "BOS") %>% 
  group_by(dest) %>% 
  sample_n(100)
# Note: sampling after grouping gives equal numbers of each group!

bos_sfo_summary <- bos_sfo %>% 
  group_by(dest) %>% 
  summarise(mean_time = mean(air_time),
            sd_time = sd(air_time), count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
bos_sfo_summary
## # A tibble: 2 x 4
##   dest  mean_time sd_time count
##   <chr>     <dbl>   <dbl> <int>
## 1 BOS        38.1    4.08   100
## 2 SFO       344.    17.7    100

9.1.1 Observations BOS vs SFO

  • Time to SFO is on average far higher than that to BOS
  • The sd is also very informative: 4 miles and 18 miles each.

9.1.2 LC10.1

If the sd(mean$time)| SFO had been say 200, we might have wanted to check if the time difference was really that high. And an SD of 100 would have been just OK, but either way we would have scaled the difference by the sd ( which one?) and decided if the difference was large in z-score terms.

gf_boxplot(air_time ~ dest, data = bos_sfo)

So here clearly there is ** NO NEED** for Hypothesis testing.

As you get more and more practice with hypothesis testing, you’ll be better able to determine in many cases whether or not the results will be statistically significant. There are circumstances where it is difficult to tell, but you should always try to make a guess FIRST about significance after you have completed your data exploration and before you actually begin the inferential techniques.

9.2 Basics of Hypothesis Testing

Based on Allen Downey’s article “There is only one test” http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html we have the following diagram:

Hypothesis Testing

Logic of hypothesis testing
- Take a random sample (or samples) from a population (or multiple populations)
- If the sample data are consistent with the null hypothesis, do not reject the null hypothesis.
- If the sample data are inconsistent with the null hypothesis (in the direction of the alternative hypothesis), reject the null hypothesis and conclude that there is evidence the alternative hypothesis is true (based on the particular sample collected).

9.2.1 Hyp Testing with infer

specify() -> hypothesize() -> generate() -> calculate() -> visualise() is the flow for Hypothesis Testing using infer.

9.2.2 Comparing Two Means

Null hypothesis : \(H_0 : \mu_1 - \mu_2 = 0\)
Alternative ( Research ) hypothesis: \(H_a : \mu_1 -\mu2 * 0\) where * = < ≠ or >.

Using ggplot2movies we check if Action movies are rated higher on IMDB than Romance movies.

movies_trimmed <- 
  movies %>% 
  select(title, year, rating, Romance, Action) %>% 
# Romance and Action are binary Yes/No variables.
  filter(!(Action == 1 & Romance == 1)) %>% 
  mutate(genre = case_when(Action == 1 ~ "Action",
                           Romance == 1 ~ "Romance")) %>% 
  filter(!is.na(genre)) %>% 
  select(-Romance, -Action)

movies_trimmed
## # A tibble: 8,878 x 4
##    title                      year rating genre  
##    <chr>                     <int>  <dbl> <chr>  
##  1 $windle                    2002    5.3 Action 
##  2 'A' gai waak               1983    7.1 Action 
##  3 'A' gai waak juk jaap      1987    7.2 Action 
##  4 'Crocodile' Dundee II      1988    5   Action 
##  5 'Gator Bait                1974    3.5 Action 
##  6 'I Know Where I'm Going!'  1945    7.7 Romance
##  7 'M' Word                   1996    4.6 Romance
##  8 'Sheba, Baby'              1975    5.5 Action 
##  9 'Til There Was You         1997    4.8 Romance
## 10 'Til We Meet Again         1940    6.3 Romance
## # … with 8,868 more rows
# Note: This is the POPULATION !!!!
gf_boxplot(rating~genre, data = movies_trimmed)

Ratings for Action movies are more spread out. The median score for Romance is higher, though it also has more outliers at both extremes.

gf_histogram(~ rating | genre ~ ., binwidth = 1, data = movies_trimmed, color = "white")

In both the flights and movies examples, we had access to the population. Normally we have access only to a sample and need to infer population parameters using sample statistics.

9.2.3 So, a sample

Let’s sample 34 each of action and romance movies.

movie_genre_sample <- 
  movies_trimmed %>% 
  group_by(genre) %>% 
  sample_n(34) %>% # 34 of each genre!
  ungroup()

The ungrouping was done so as to allow us to permute the values of rating. Without doing it, the data stays grouped and cannot be randomized.

movie_genre_sample %>% 
  gf_boxplot(rating ~ genre, fill = ~ genre)

movie_genre_sample %>% 
  gf_histogram(~ rating | genre ~.,binwidth = 1,color = "black",fill = ~ genre)

Boxplots show some difference in medians, but the histograms are not clear; the two genres also have different shaped distribution. Let us calculate the stats for the sample:

summary_ratings <- 
  movie_genre_sample %>% 
  group_by(genre) %>% 
  summarise(mean = mean(rating),
            std_dev = sd(rating),
            n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
summary_ratings
## # A tibble: 2 x 4
##   genre    mean std_dev     n
##   <chr>   <dbl>   <dbl> <int>
## 1 Action   5.10    1.59    34
## 2 Romance  6.03    1.15    34

9.2.4 Hypothesis

\(H_0 : \mu_{Romance} - \mu_{Action} = 0 \\\)
\(H_a : \mu_{Romance} - \mu_{Action} ≠ 0\)

Note that Hypothesis is about the population. Always. No sense otherwise ;-D

9.2.5 Test Statistic

We look for a difference in sample means, by group \(\delta = \bar x_{Romance} - \bar x_{Action}\)

9.2.6 Observed Effect

obs_diff <- 
  movie_genre_sample %>% 
  specify(rating ~ genre) %>% 
  calculate(stat = "diff in means", order = c("Romance", "Action"))
obs_diff
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 0.935

We need to now simulate and find out if such an obs_diff or higher could occur purely by chance (under \(H_0\)) with high enough probability ( > 0.05).

We take the combined (34 + 34) set of movies, shuffle them (i.e. their genres) and split them into two, at random. If the population means are the same, then the difference in means between these two new subsets must also be very small. We can do this a 1000 times to check.

This is called permutation or randomization.

generated_samples <- 
  movie_genre_sample %>% 
  specify(rating ~ genre) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 5000,type = "permute") 

null_distribution_two_means <- 
  generated_samples %>% 
  calculate(stat = "diff in means", order = c("Romance", "Action"))

#deprecated command 
null_distribution_two_means %>%  
  visualise(obs_stat = obs_diff,direction = "both",bins = 100)
## Warning: `visualize()` should no longer be used to plot a p-value. Arguments
## `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are deprecated. Use
## `shade_p_value()` instead.

#latest and greatest command...things change before I can get used to them.;-()
null_distribution_two_means %>%  
  visualise() + shade_p_value(obs_stat = obs_diff,direction = "both")

p_value <- 
  null_distribution_two_means %>% 
  get_p_value(obs_stat = obs_diff,direction = "both")
p_value
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.006

This p-value is very small and less than \(\alpha = 0.05\). This is the probability of obtaining the obs_diff by chance, ASSUMING that Null Hypothesis is true. Hence we need to reject the null hypothesis and state that there is evidence that the mean ratings between Romance and Action movies are indeed different.

In order to generate Confidence Intervals with the single sample we have, we turn to the bootstrap method again:

percentile_ci_two_means <- 
  movie_genre_sample %>% 
  specify(rating ~ genre) %>% 
  # hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "bootstrap") %>% 
  calculate(stat = "diff in means", order = c("Romance", "Action")) %>% 
  get_ci()
percentile_ci_two_means
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1  0.244    1.56

Hence the difference in mean rating between Romance and Action movies is between these two values with 95% probability.

9.2.7 Problem LC10.14

Repeat the above inference using median rating instead of mean rating.

movie_genre_sample # same sample as before
## # A tibble: 68 x 4
##    title                      year rating genre 
##    <chr>                     <int>  <dbl> <chr> 
##  1 City Limits                1985    2.8 Action
##  2 Tarzan's Desert Mystery    1943    6.3 Action
##  3 Eagle Island               1986    5.6 Action
##  4 U.S. Seals                 1999    2.4 Action
##  5 7th Voyage of Sinbad, The  1958    6.9 Action
##  6 Sur le seuil               2003    6.6 Action
##  7 Death Wish                 1974    6.5 Action
##  8 Things Are Tough All Over  1982    5.2 Action
##  9 Aashiq                     2001    3.7 Action
## 10 Complex World              1992    6.6 Action
## # … with 58 more rows
# EDA
movie_genre_sample %>% 
  gf_boxplot(rating ~ genre, fill = ~genre)

#Stat
obs_diff <- movie_genre_sample %>% 
  specify(rating ~ genre) %>% 
  calculate(stat = "diff in medians" , order = c("Romance", "Action"))
obs_diff
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  1.30
# Generate Samples
samples <- movie_genre_sample %>% 
  specify(rating ~ genre) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") 

# Null distribution
null_dist <- samples %>% 
  calculate(stat = "diff in medians", order = c("Romance", "Action"))

#Visualise
null_dist %>% visualise()

null_dist %>%  
  visualise() + shade_p_value(obs_stat = obs_diff,direction = "both")

# p_value and ci
p_value <- null_dist %>% 
  get_p_value(obs_stat = obs_diff,direction = "both")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
p_value
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0
ci <- 
  movie_genre_sample %>% 
  specify(rating~genre) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "diff in medians", order = c("Romance", "Action")) %>% 
  get_ci()
ci
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1   0.25       2

The difference in medians between the two groups is between 0.5 and 2.2 with 95% probability. The p-value is small enough for us to state that we can reject the null hypothesis and that there is indeed evidence for a difference in medians or rating between the two movie categories.

10 Inference for Regression Coefficients

In regression we compute a slope and intercept for a model between explanatory and response variables.

We can use the idea of permutation/randomization to determine the null distribution and conduct a hypothesis test for the population slope.

We will use the teacher evaluation dataset again.

evals %>% 
  specify(score ~ bty_avg)
## Response: score (numeric)
## Explanatory: bty_avg (numeric)
## # A tibble: 463 x 2
##    score bty_avg
##    <dbl>   <dbl>
##  1   4.7    5   
##  2   4.1    5   
##  3   3.9    5   
##  4   4.8    5   
##  5   4.6    3   
##  6   4.3    3   
##  7   2.8    3   
##  8   4.1    3.33
##  9   3.4    3.33
## 10   4.5    3.17
## # … with 453 more rows
# Observed Statistic
obs_slope <- 
  evals %>% 
  specify(score ~ bty_avg) %>% 
  calculate(stat = "slope")
obs_slope
## # A tibble: 1 x 1
##     stat
##    <dbl>
## 1 0.0666

10.1 Hypothesis

\(H_0: Null: Population\ Slope\ \beta_1 = 0\)
\(H_a: Alternative: Population\ Slope\ \beta_1 > 0\)

10.2 Simulated Data

If β1=0, we said above that there is no relationship between the teaching and beauty scores. If there is no relationship, then any one of the teaching score values could have just as likely occurred with any of the other beauty score values instead of the one that it actually did fall with. We, therefore, have another example of permuting in our simulating of data under the null hypothesis.

Earlier with the movies, we shuffled rating and genre. Here we equivalently shuffle score and bty_avg.

# Distribution *under* H0
null_slope_dist <- 
  evals %>% 
  specify(score ~ bty_avg) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 10000, type = "permute") %>% 
  calculate(stat = "slope")

null_slope_dist %>% 
  visualise() + 
  shade_p_value(obs_stat = obs_slope,direction = "greater")

p_value <- null_slope_dist %>% 
  get_p_value(obs_stat = obs_slope,direction = "greater")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
p_value
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Since p-value is very small, it means that the observed value is very improbable and unlikely to have occured merely by chance. Hence we can reject the null hypothesis and state there is a positive association between beauty score and teaching score in this dataset.

We can use bootstrapping to calculate how strong the slope is and with what confidence we can state that. ### Bootstrapping for slope

bootstrap_slope_dist <- evals %>% 
  specify(score ~ bty_avg) %>% 
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "slope")

bootstrap_slope_dist %>% 
  visualise() + shade_p_value(obs_stat = obs_slope,direction = "both")

#Percentile Bootstrap Confidence Intervals
percentile_slope_ci <- bootstrap_slope_dist %>% 
  get_ci(level = 0.99,type = "percentile")
percentile_slope_ci
## # A tibble: 1 x 2
##   `0.5%` `99.5%`
##    <dbl>   <dbl>
## 1 0.0225   0.111
# Standard Error Confidence Intervals
se_slope_ci <- bootstrap_slope_dist %>% 
  get_ci(type = "se",point_estimate = obs_slope, level = 0.99)
se_slope_ci
## # A tibble: 1 x 2
##    lower upper
##    <dbl> <dbl>
## 1 0.0223 0.111

Since we have a large number of samples we have symmetric bell-shaped distributions for both bootstrap and se methods and hence the two CIs are very similar.

10.2.1 Problem LC11.1

Repeat the inference using stat = "correlation" in calculate().

# Observed Statistic
obs_cor <- 
  evals %>% 
  specify(score ~ bty_avg) %>% 
  calculate(stat = "correlation")
obs_cor
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 0.187
# Null Distribution
null_dist_cor <- 
  evals %>% 
  specify(score ~ bty_avg) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 10000, type = "permute") %>% 
  calculate(stat = "correlation")

null_dist_cor %>% 
  visualise() + shade_p_value(obs_stat = obs_cor,direction = "greater")

p_value <- 
  null_dist_cor %>% 
  get_p_value(obs_stat = obs_cor,direction = "greater")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
p_value
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

11 CI with bootstrap

bootstrap_cor_dist <- 
  evals %>% 
  specify(score ~ bty_avg) %>% 
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "correlation") 

bootstrap_cor_dist %>% 
  visualise() + shade_p_value(obs_stat = obs_cor, direction = "greater")

percentile_ci_cor <- 
  bootstrap_cor_dist %>% 
  get_ci(level = 0.99)
percentile_ci_cor
## # A tibble: 1 x 2
##   `0.5%` `99.5%`
##    <dbl>   <dbl>
## 1 0.0619   0.307
# CI with Standard Error
se_ci_cor <- 
bootstrap_cor_dist %>%  
  get_ci(type = "se", point_estimate = obs_cor)
se_ci_cor
## # A tibble: 1 x 2
##    lower upper
##    <dbl> <dbl>
## 1 0.0947 0.280

11.1 Inference for Multiple Regression

Here the response variable score was regressed against - age a numerical explanatory variable
- gender a categorical explanatory variable

11.1.1 Refresher !

evals_multiple <- 
  evals %>% 
  select(score, bty_avg, age, gender, ethnicity, language, rank)

# Model with no interaction
score_model_1 <- lm(score ~ gender + age, data = evals_multiple)
score_model_1 %>% 
  get_regression_table()
## # A tibble: 3 x 7
##   term       estimate std_error statistic p_value lower_ci upper_ci
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept     4.48      0.125     35.8    0        4.24     4.73 
## 2 gendermale    0.191     0.052      3.63   0        0.087    0.294
## 3 age          -0.009     0.003     -3.28   0.001   -0.014   -0.003
# Model with Interaction
score_model_2 <- lm(score ~ gender * age, data = evals_multiple)
score_model_2 %>% 
  get_regression_table()
## # A tibble: 4 x 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept         4.88      0.205     23.8    0        4.48     5.29 
## 2 gendermale       -0.446     0.265     -1.68   0.094   -0.968    0.076
## 3 age              -0.018     0.004     -3.92   0       -0.026   -0.009
## 4 gendermale:age    0.014     0.006      2.45   0.015    0.003    0.024
#Plots without interaction ( Note the use of `~` everywhere!!) 
#Somewhat clumsy code 
# evals_multiple %>% 
#   gf_point(score ~ age, color = ~ gender, data = evals_multiple, 
#            title = "Plots without Interaction") %>% 
#   gf_abline(data = filter(evals_multiple, gender == "male"), 
#             intercept = ~ 4.484+0.191, slope = ~ -0.009,color = ~ gender) %>% 
#   gf_abline(data = filter(evals_multiple, gender == "female"), 
#             intercept = ~ 4.484, slope = ~ -0.009,color = ~ gender)

# Another more elegant way
library(modelr)
## 
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
## 
##     bootstrap
## The following object is masked from 'package:ggformula':
## 
##     na.warn
evals_multiple %>% 
     add_predictions(., score_model_1) %>% 
     gf_point(score ~ age, color = ~ gender, title = "Plots without Interaction") %>% 
     gf_line(pred ~ age, color = ~ gender)

# Plots with interaction
# Clumsy Code
# evals_multiple %>% 
#   gf_point(score ~ age, color = ~ gender, data = evals_multiple, title = "Plots with Interaction") %>% 
#   gf_abline( intercept = ~ 4.883 - 0.446,slope =  ~ -0.018+0.014, color = ~gender, data = filter(evals_multiple, gender == "male")) %>% 
#   gf_abline( intercept = ~ 4.883,slope =  ~ -0.018, color = ~gender, data = filter(evals_multiple, gender == "female"))
   
# Another more elegant way
#library(modelr)
evals_multiple %>% 
     add_predictions(., score_model_2) %>% 
     gf_point(score ~ age, color = ~ gender, title = "Plots with Interaction") %>% 
     gf_line(pred ~ age, color = ~ gender)

Gothcha !!!!

11.1.2 Residual Analysis

Recall the residuals can be thought of as the error or the “lack-of-fit” between the observed value y and the fitted value ˆy on the blue regression line in Figure 6.6. Ideally when we fit a regression model, we’d like there to be no systematic pattern to these residuals.

We investigate the residuals in this section. We can also decide based on that analysis if the model we have is too simple.

11.1.2.1 Teacher Scores vs Beauty

evals_ch6 <- 
  evals %>% 
  select(score, bty_avg, age, gender)
score_model <- lm(score ~ bty_avg, data = evals_ch6)

score_model %>% 
  get_regression_table()
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
reg_points <- 
  score_model %>% 
  get_regression_points()

# Plot residuals vs Explanatory covariate
reg_points %>% 
  gf_point(residual ~ bty_avg, xlab = "Beauty Score") %>% 
  gf_hline(yintercept = ~ 0,color = "blue")

reg_points %>% 
  gf_histogram(~ residual,binwidth = 0.25, color = "white")

mean(reg_points$residual)
## [1] 7.3434e-05

Note that while mean(residuals) is zero, as it should be, there is a negative skew to the residuals. There are many small negative residuals, but relatively fewer, and therefore large positive residuals. So our model tends to underestimate the score those who have high scores.

We would really like our residuals to be normally distributed.

11.1.2.2 Problem LC11.2 Teacher Scores vs Age

Score vs age

score_model_age <- lm(score ~ age,data = evals)

score_model %>% 
  get_regression_table()
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
reg_points <- score_model %>% 
  get_regression_points() 
reg_points
## # A tibble: 463 x 5
##       ID score bty_avg score_hat residual
##    <int> <dbl>   <dbl>     <dbl>    <dbl>
##  1     1   4.7    5         4.21    0.486
##  2     2   4.1    5         4.21   -0.114
##  3     3   3.9    5         4.21   -0.314
##  4     4   4.8    5         4.21    0.586
##  5     5   4.6    3         4.08    0.52 
##  6     6   4.3    3         4.08    0.22 
##  7     7   2.8    3         4.08   -1.28 
##  8     8   4.1    3.33      4.10   -0.002
##  9     9   3.4    3.33      4.10   -0.702
## 10    10   4.5    3.17      4.09    0.409
## # … with 453 more rows
reg_points%>% 
  gf_histogram(~residual,binwidth = 0.25, color = "white")

There is some negative skew again, as with the bty_avg explanatory variable.

11.1.2.3 Life Expectancy vs Continent with Gapminder Data

library(gapminder)
gapminder2007 <- gapminder %>% filter(year == "2007")
lifeExp_model <- lm(data = gapminder2007, lifeExp ~ continent)

lifeExp_model %>% get_regression_table()
## # A tibble: 5 x 7
##   term              estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept             54.8      1.02     53.4        0     52.8     56.8
## 2 continentAmericas     18.8      1.8      10.4        0     15.2     22.4
## 3 continentAsia         15.9      1.65      9.68       0     12.7     19.2
## 4 continentEurope       22.8      1.70     13.5        0     19.5     26.2
## 5 continentOceania      25.9      5.33      4.86       0     15.4     36.4
reg_points <- lifeExp_model %>% 
  get_regression_points()

reg_points %>% gf_jitter(residual~continent, height = 0.1,width = 0.1)

# Note the outlier in Asia, with very low Life Expectancy. 

reg_points %>% 
  gf_histogram(~residual, binwidth = 5, color = "white")

gapminder2007 %>% filter(continent == "Asia") %>% 
  arrange(lifeExp) %>% 
  gf_col(lifeExp ~ reorder(country, lifeExp),xlab = "Country", ylab = "Life Expectancy (years)") %>% 
  gf_theme(coord_flip()) %>% 
  gf_theme(axis.text.y = 
             element_text(hjust = 1.0, color = "red", size = 6))

11.1.2.4 Problem LC11.3 GDP per Capita vs Continent with Gapminder

  • gdpPercap vs continent with gapminder
  • All countries in 2007
gapminder2007
## # A tibble: 142 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # … with 132 more rows
gdp_model <- lm(gdpPercap ~ continent, data = gapminder2007)
reg_points <- gdp_model %>% 
  get_regression_points()
reg_points
## # A tibble: 142 x 5
##       ID gdpPercap continent gdpPercap_hat residual
##    <int>     <dbl> <fct>             <dbl>    <dbl>
##  1     1      975. Asia             12473.  -11498.
##  2     2     5937. Europe           25054.  -19117.
##  3     3     6223. Africa            3089.    3134.
##  4     4     4797. Africa            3089.    1708.
##  5     5    12779. Americas         11003.    1776.
##  6     6    34435. Oceania          29810.    4625.
##  7     7    36126. Europe           25054.   11072.
##  8     8    29796. Asia             12473.   17323.
##  9     9     1391. Asia             12473.  -11082.
## 10    10    33693. Europe           25054.    8638.
## # … with 132 more rows
# Plots
gf_point(residual ~ continent, height = 0, width =  0.05,data = reg_points)

gf_histogram(~residual, binwidth = 2000,color = "white",data = reg_points)

gapminder2007 %>% arrange(gdpPercap)
## # A tibble: 142 x 6
##    country                  continent  year lifeExp      pop gdpPercap
##    <fct>                    <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Congo, Dem. Rep.         Africa     2007    46.5 64606759      278.
##  2 Liberia                  Africa     2007    45.7  3193942      415.
##  3 Burundi                  Africa     2007    49.6  8390505      430.
##  4 Zimbabwe                 Africa     2007    43.5 12311143      470.
##  5 Guinea-Bissau            Africa     2007    46.4  1472041      579.
##  6 Niger                    Africa     2007    56.9 12894865      620.
##  7 Eritrea                  Africa     2007    58.0  4906585      641.
##  8 Ethiopia                 Africa     2007    52.9 76511887      691.
##  9 Central African Republic Africa     2007    44.7  4369038      706.
## 10 Gambia                   Africa     2007    59.4  1688359      753.
## # … with 132 more rows

There are a large number of residuals \(y - \hat y\) just on the positive side of zero, which means these have been underestimated by the model. But since the absolute value of the residuals for these points is small, we may say that the model is adequate.

11.1.2.5 Credit Card Balance vs Limit and Income

Credit
##      ID  Income Limit Rating Cards Age Education Gender Student Married
## 1     1  14.891  3606    283     2  34        11   Male      No     Yes
## 2     2 106.025  6645    483     3  82        15 Female     Yes     Yes
## 3     3 104.593  7075    514     4  71        11   Male      No      No
## 4     4 148.924  9504    681     3  36        11 Female      No      No
## 5     5  55.882  4897    357     2  68        16   Male      No     Yes
## 6     6  80.180  8047    569     4  77        10   Male      No      No
## 7     7  20.996  3388    259     2  37        12 Female      No      No
## 8     8  71.408  7114    512     2  87         9   Male      No      No
## 9     9  15.125  3300    266     5  66        13 Female      No      No
## 10   10  71.061  6819    491     3  41        19 Female     Yes     Yes
## 11   11  63.095  8117    589     4  30        14   Male      No     Yes
## 12   12  15.045  1311    138     3  64        16   Male      No      No
## 13   13  80.616  5308    394     1  57         7 Female      No     Yes
## 14   14  43.682  6922    511     1  49         9   Male      No     Yes
## 15   15  19.144  3291    269     2  75        13 Female      No      No
## 16   16  20.089  2525    200     3  57        15 Female      No     Yes
## 17   17  53.598  3714    286     3  73        17 Female      No     Yes
## 18   18  36.496  4378    339     3  69        15 Female      No     Yes
## 19   19  49.570  6384    448     1  28         9 Female      No     Yes
## 20   20  42.079  6626    479     2  44         9   Male      No      No
## 21   21  17.700  2860    235     4  63        16 Female      No      No
## 22   22  37.348  6378    458     1  72        17 Female      No      No
## 23   23  20.103  2631    213     3  61        10   Male      No     Yes
## 24   24  64.027  5179    398     5  48         8   Male      No     Yes
## 25   25  10.742  1757    156     3  57        15 Female      No      No
## 26   26  14.090  4323    326     5  25        16 Female      No     Yes
## 27   27  42.471  3625    289     6  44        12 Female     Yes      No
## 28   28  32.793  4534    333     2  44        16   Male      No      No
## 29   29 186.634 13414    949     2  41        14 Female      No     Yes
## 30   30  26.813  5611    411     4  55        16 Female      No      No
## 31   31  34.142  5666    413     4  47         5 Female      No     Yes
## 32   32  28.941  2733    210     5  43        16   Male      No     Yes
## 33   33 134.181  7838    563     2  48        13 Female      No      No
## 34   34  31.367  1829    162     4  30        10   Male      No     Yes
## 35   35  20.150  2646    199     2  25        14 Female      No     Yes
## 36   36  23.350  2558    220     3  49        12 Female     Yes      No
## 37   37  62.413  6457    455     2  71        11 Female      No     Yes
## 38   38  30.007  6481    462     2  69         9 Female      No     Yes
## 39   39  11.795  3899    300     4  25        10 Female      No      No
## 40   40  13.647  3461    264     4  47        14   Male      No     Yes
## 41   41  34.950  3327    253     3  54        14 Female      No      No
## 42   42 113.659  7659    538     2  66        15   Male     Yes     Yes
## 43   43  44.158  4763    351     2  66        13 Female      No     Yes
## 44   44  36.929  6257    445     1  24        14 Female      No     Yes
## 45   45  31.861  6375    469     3  25        16 Female      No     Yes
## 46   46  77.380  7569    564     3  50        12 Female      No     Yes
## 47   47  19.531  5043    376     2  64        16 Female     Yes     Yes
## 48   48  44.646  4431    320     2  49        15   Male     Yes     Yes
## 49   49  44.522  2252    205     6  72        15   Male      No     Yes
## 50   50  43.479  4569    354     4  49        13   Male     Yes     Yes
## 51   51  36.362  5183    376     3  49        15   Male      No     Yes
## 52   52  39.705  3969    301     2  27        20   Male      No     Yes
## 53   53  44.205  5441    394     1  32        12   Male      No     Yes
## 54   54  16.304  5466    413     4  66        10   Male      No     Yes
## 55   55  15.333  1499    138     2  47         9 Female      No     Yes
## 56   56  32.916  1786    154     2  60         8 Female      No     Yes
## 57   57  57.100  4742    372     7  79        18 Female      No     Yes
## 58   58  76.273  4779    367     4  65        14 Female      No     Yes
## 59   59  10.354  3480    281     2  70        17   Male      No     Yes
## 60   60  51.872  5294    390     4  81        17 Female      No      No
## 61   61  35.510  5198    364     2  35        20 Female      No      No
## 62   62  21.238  3089    254     3  59        10 Female      No      No
## 63   63  30.682  1671    160     2  77         7 Female      No      No
## 64   64  14.132  2998    251     4  75        17   Male      No      No
## 65   65  32.164  2937    223     2  79        15 Female      No     Yes
## 66   66  12.000  4160    320     4  28        14 Female      No     Yes
## 67   67 113.829  9704    694     4  38        13 Female      No     Yes
## 68   68  11.187  5099    380     4  69        16 Female      No      No
## 69   69  27.847  5619    418     2  78        15 Female      No     Yes
## 70   70  49.502  6819    505     4  55        14   Male      No     Yes
## 71   71  24.889  3954    318     4  75        12   Male      No     Yes
## 72   72  58.781  7402    538     2  81        12 Female      No     Yes
## 73   73  22.939  4923    355     1  47        18 Female      No     Yes
## 74   74  23.989  4523    338     4  31        15   Male      No      No
## 75   75  16.103  5390    418     4  45        10 Female      No     Yes
## 76   76  33.017  3180    224     2  28        16   Male      No     Yes
## 77   77  30.622  3293    251     1  68        16   Male     Yes      No
## 78   78  20.936  3254    253     1  30        15 Female      No      No
## 79   79 110.968  6662    468     3  45        11 Female      No     Yes
## 80   80  15.354  2101    171     2  65        14   Male      No      No
## 81   81  27.369  3449    288     3  40         9 Female      No     Yes
## 82   82  53.480  4263    317     1  83        15   Male      No      No
## 83   83  23.672  4433    344     3  63        11   Male      No      No
## 84   84  19.225  1433    122     3  38        14 Female      No      No
## 85   85  43.540  2906    232     4  69        11   Male      No      No
## 86   86 152.298 12066    828     4  41        12 Female      No     Yes
## 87   87  55.367  6340    448     1  33        15   Male      No     Yes
## 88   88  11.741  2271    182     4  59        12 Female      No      No
## 89   89  15.560  4307    352     4  57         8   Male      No     Yes
## 90   90  59.530  7518    543     3  52         9 Female      No      No
## 91   91  20.191  5767    431     4  42        16   Male      No     Yes
## 92   92  48.498  6040    456     3  47        16   Male      No     Yes
## 93   93  30.733  2832    249     4  51        13   Male      No      No
## 94   94  16.479  5435    388     2  26        16   Male      No      No
## 95   95  38.009  3075    245     3  45        15 Female      No      No
## 96   96  14.084   855    120     5  46        17 Female      No     Yes
## 97   97  14.312  5382    367     1  59        17   Male     Yes      No
## 98   98  26.067  3388    266     4  74        17 Female      No     Yes
## 99   99  36.295  2963    241     2  68        14 Female     Yes      No
## 100 100  83.851  8494    607     5  47        18   Male      No      No
## 101 101  21.153  3736    256     1  41        11   Male      No      No
## 102 102  17.976  2433    190     3  70        16 Female     Yes      No
## 103 103  68.713  7582    531     2  56        16   Male     Yes      No
## 104 104 146.183  9540    682     6  66        15   Male      No      No
## 105 105  15.846  4768    365     4  53        12 Female      No      No
## 106 106  12.031  3182    259     2  58        18 Female      No     Yes
## 107 107  16.819  1337    115     2  74        15   Male      No     Yes
## 108 108  39.110  3189    263     3  72        12   Male      No      No
## 109 109 107.986  6033    449     4  64        14   Male      No     Yes
## 110 110  13.561  3261    279     5  37        19   Male      No     Yes
## 111 111  34.537  3271    250     3  57        17 Female      No     Yes
## 112 112  28.575  2959    231     2  60        11 Female      No      No
## 113 113  46.007  6637    491     4  42        14   Male      No     Yes
## 114 114  69.251  6386    474     4  30        12 Female      No     Yes
## 115 115  16.482  3326    268     4  41        15   Male      No      No
## 116 116  40.442  4828    369     5  81         8 Female      No      No
## 117 117  35.177  2117    186     3  62        16 Female      No      No
## 118 118  91.362  9113    626     1  47        17   Male      No     Yes
## 119 119  27.039  2161    173     3  40        17 Female      No      No
## 120 120  23.012  1410    137     3  81        16   Male      No      No
## 121 121  27.241  1402    128     2  67        15 Female      No     Yes
## 122 122 148.080  8157    599     2  83        13   Male      No     Yes
## 123 123  62.602  7056    481     1  84        11 Female      No      No
## 124 124  11.808  1300    117     3  77        14 Female      No      No
## 125 125  29.564  2529    192     1  30        12 Female      No     Yes
## 126 126  27.578  2531    195     1  34        15 Female      No     Yes
## 127 127  26.427  5533    433     5  50        15 Female     Yes     Yes
## 128 128  57.202  3411    259     3  72        11 Female      No      No
## 129 129 123.299  8376    610     2  89        17   Male     Yes      No
## 130 130  18.145  3461    279     3  56        15   Male      No     Yes
## 131 131  23.793  3821    281     4  56        12 Female     Yes     Yes
## 132 132  10.726  1568    162     5  46        19   Male      No     Yes
## 133 133  23.283  5443    407     4  49        13   Male      No     Yes
## 134 134  21.455  5829    427     4  80        12 Female      No     Yes
## 135 135  34.664  5835    452     3  77        15 Female      No     Yes
## 136 136  44.473  3500    257     3  81        16 Female      No      No
## 137 137  54.663  4116    314     2  70         8 Female      No      No
## 138 138  36.355  3613    278     4  35         9   Male      No     Yes
## 139 139  21.374  2073    175     2  74        11 Female      No     Yes
## 140 140 107.841 10384    728     3  87         7   Male      No      No
## 141 141  39.831  6045    459     3  32        12 Female     Yes     Yes
## 142 142  91.876  6754    483     2  33        10   Male      No     Yes
## 143 143 103.893  7416    549     3  84        17   Male      No      No
## 144 144  19.636  4896    387     3  64        10 Female      No      No
## 145 145  17.392  2748    228     3  32        14   Male      No     Yes
## 146 146  19.529  4673    341     2  51        14   Male      No      No
## 147 147  17.055  5110    371     3  55        15 Female      No     Yes
## 148 148  23.857  1501    150     3  56        16   Male      No     Yes
## 149 149  15.184  2420    192     2  69        11 Female      No     Yes
## 150 150  13.444   886    121     5  44        10   Male      No     Yes
## 151 151  63.931  5728    435     3  28        14 Female      No     Yes
## 152 152  35.864  4831    353     3  66        13 Female      No     Yes
## 153 153  41.419  2120    184     4  24        11 Female     Yes      No
## 154 154  92.112  4612    344     3  32        17   Male      No      No
## 155 155  55.056  3155    235     2  31        16   Male      No     Yes
## 156 156  19.537  1362    143     4  34         9 Female      No     Yes
## 157 157  31.811  4284    338     5  75        13 Female      No     Yes
## 158 158  56.256  5521    406     2  72        16 Female     Yes     Yes
## 159 159  42.357  5550    406     2  83        12 Female      No     Yes
## 160 160  53.319  3000    235     3  53        13   Male      No      No
## 161 161  12.238  4865    381     5  67        11 Female      No      No
## 162 162  31.353  1705    160     3  81        14   Male      No     Yes
## 163 163  63.809  7530    515     1  56        12   Male      No     Yes
## 164 164  13.676  2330    203     5  80        16 Female      No      No
## 165 165  76.782  5977    429     4  44        12   Male      No     Yes
## 166 166  25.383  4527    367     4  46        11   Male      No     Yes
## 167 167  35.691  2880    214     2  35        15   Male      No      No
## 168 168  29.403  2327    178     1  37        14 Female      No     Yes
## 169 169  27.470  2820    219     1  32        11 Female      No     Yes
## 170 170  27.330  6179    459     4  36        12 Female      No     Yes
## 171 171  34.772  2021    167     3  57         9   Male      No      No
## 172 172  36.934  4270    299     1  63         9 Female      No     Yes
## 173 173  76.348  4697    344     4  60        18   Male      No      No
## 174 174  14.887  4745    339     3  58        12   Male      No     Yes
## 175 175 121.834 10673    750     3  54        16   Male      No      No
## 176 176  30.132  2168    206     3  52        17   Male      No      No
## 177 177  24.050  2607    221     4  32        18   Male      No     Yes
## 178 178  22.379  3965    292     2  34        14 Female      No     Yes
## 179 179  28.316  4391    316     2  29        10 Female      No      No
## 180 180  58.026  7499    560     5  67        11 Female      No      No
## 181 181  10.635  3584    294     5  69        16   Male      No     Yes
## 182 182  46.102  5180    382     3  81        12   Male      No     Yes
## 183 183  58.929  6420    459     2  66         9 Female      No     Yes
## 184 184  80.861  4090    335     3  29        15 Female      No     Yes
## 185 185 158.889 11589    805     1  62        17 Female      No     Yes
## 186 186  30.420  4442    316     1  30        14 Female      No      No
## 187 187  36.472  3806    309     2  52        13   Male      No      No
## 188 188  23.365  2179    167     2  75        15   Male      No      No
## 189 189  83.869  7667    554     2  83        11   Male      No      No
## 190 190  58.351  4411    326     2  85        16 Female      No     Yes
## 191 191  55.187  5352    385     4  50        17 Female      No     Yes
## 192 192 124.290  9560    701     3  52        17 Female     Yes      No
## 193 193  28.508  3933    287     4  56        14   Male      No     Yes
## 194 194 130.209 10088    730     7  39        19 Female      No     Yes
## 195 195  30.406  2120    181     2  79        14   Male      No     Yes
## 196 196  23.883  5384    398     2  73        16 Female      No     Yes
## 197 197  93.039  7398    517     1  67        12   Male      No     Yes
## 198 198  50.699  3977    304     2  84        17 Female      No      No
## 199 199  27.349  2000    169     4  51        16 Female      No     Yes
## 200 200  10.403  4159    310     3  43         7   Male      No     Yes
## 201 201  23.949  5343    383     2  40        18   Male      No     Yes
## 202 202  73.914  7333    529     6  67        15 Female      No     Yes
## 203 203  21.038  1448    145     2  58        13 Female      No     Yes
## 204 204  68.206  6784    499     5  40        16 Female     Yes      No
## 205 205  57.337  5310    392     2  45         7 Female      No      No
## 206 206  10.793  3878    321     8  29        13   Male      No      No
## 207 207  23.450  2450    180     2  78        13   Male      No      No
## 208 208  10.842  4391    358     5  37        10 Female     Yes     Yes
## 209 209  51.345  4327    320     3  46        15   Male      No      No
## 210 210 151.947  9156    642     2  91        11 Female      No     Yes
## 211 211  24.543  3206    243     2  62        12 Female      No     Yes
## 212 212  29.567  5309    397     3  25        15   Male      No      No
## 213 213  39.145  4351    323     2  66        13   Male      No     Yes
## 214 214  39.422  5245    383     2  44        19   Male      No      No
## 215 215  34.909  5289    410     2  62        16 Female      No     Yes
## 216 216  41.025  4229    337     3  79        19 Female      No     Yes
## 217 217  15.476  2762    215     3  60        18   Male      No      No
## 218 218  12.456  5395    392     3  65        14   Male      No     Yes
## 219 219  10.627  1647    149     2  71        10 Female     Yes     Yes
## 220 220  38.954  5222    370     4  76        13 Female      No      No
## 221 221  44.847  5765    437     3  53        13 Female     Yes      No
## 222 222  98.515  8760    633     5  78        11 Female      No      No
## 223 223  33.437  6207    451     4  44         9   Male     Yes      No
## 224 224  27.512  4613    344     5  72        17   Male      No     Yes
## 225 225 121.709  7818    584     4  50         6   Male      No     Yes
## 226 226  15.079  5673    411     4  28        15 Female      No     Yes
## 227 227  59.879  6906    527     6  78        15 Female      No      No
## 228 228  66.989  5614    430     3  47        14 Female      No     Yes
## 229 229  69.165  4668    341     2  34        11 Female      No      No
## 230 230  69.943  7555    547     3  76         9   Male      No     Yes
## 231 231  33.214  5137    387     3  59         9   Male      No      No
## 232 232  25.124  4776    378     4  29        12   Male      No     Yes
## 233 233  15.741  4788    360     1  39        14   Male      No     Yes
## 234 234  11.603  2278    187     3  71        11   Male      No     Yes
## 235 235  69.656  8244    579     3  41        14   Male      No     Yes
## 236 236  10.503  2923    232     3  25        18 Female      No     Yes
## 237 237  42.529  4986    369     2  37        11   Male      No     Yes
## 238 238  60.579  5149    388     5  38        15   Male      No     Yes
## 239 239  26.532  2910    236     6  58        19 Female      No     Yes
## 240 240  27.952  3557    263     1  35        13 Female      No     Yes
## 241 241  29.705  3351    262     5  71        14 Female      No     Yes
## 242 242  15.602   906    103     2  36        11   Male      No     Yes
## 243 243  20.918  1233    128     3  47        18 Female     Yes     Yes
## 244 244  58.165  6617    460     1  56        12 Female      No     Yes
## 245 245  22.561  1787    147     4  66        15 Female      No      No
## 246 246  34.509  2001    189     5  80        18 Female      No     Yes
## 247 247  19.588  3211    265     4  59        14 Female      No      No
## 248 248  36.364  2220    188     3  50        19   Male      No      No
## 249 249  15.717   905     93     1  38        16   Male     Yes     Yes
## 250 250  22.574  1551    134     3  43        13 Female     Yes     Yes
## 251 251  10.363  2430    191     2  47        18 Female      No     Yes
## 252 252  28.474  3202    267     5  66        12   Male      No     Yes
## 253 253  72.945  8603    621     3  64         8 Female      No      No
## 254 254  85.425  5182    402     6  60        12   Male      No     Yes
## 255 255  36.508  6386    469     4  79         6 Female      No     Yes
## 256 256  58.063  4221    304     3  50         8   Male      No      No
## 257 257  25.936  1774    135     2  71        14 Female      No      No
## 258 258  15.629  2493    186     1  60        14   Male      No     Yes
## 259 259  41.400  2561    215     2  36        14   Male      No     Yes
## 260 260  33.657  6196    450     6  55         9 Female      No      No
## 261 261  67.937  5184    383     4  63        12   Male      No     Yes
## 262 262 180.379  9310    665     3  67         8 Female     Yes     Yes
## 263 263  10.588  4049    296     1  66        13 Female      No     Yes
## 264 264  29.725  3536    270     2  52        15 Female      No      No
## 265 265  27.999  5107    380     1  55        10   Male      No     Yes
## 266 266  40.885  5013    379     3  46        13 Female      No     Yes
## 267 267  88.830  4952    360     4  86        16 Female      No     Yes
## 268 268  29.638  5833    433     3  29        15 Female      No     Yes
## 269 269  25.988  1349    142     4  82        12   Male      No      No
## 270 270  39.055  5565    410     4  48        18 Female      No     Yes
## 271 271  15.866  3085    217     1  39        13   Male      No      No
## 272 272  44.978  4866    347     1  30        10 Female      No      No
## 273 273  30.413  3690    299     2  25        15 Female     Yes      No
## 274 274  16.751  4706    353     6  48        14   Male     Yes      No
## 275 275  30.550  5869    439     5  81         9 Female      No      No
## 276 276 163.329  8732    636     3  50        14   Male      No     Yes
## 277 277  23.106  3476    257     2  50        15 Female      No      No
## 278 278  41.532  5000    353     2  50        12   Male      No     Yes
## 279 279 128.040  6982    518     2  78        11 Female      No     Yes
## 280 280  54.319  3063    248     3  59         8 Female     Yes      No
## 281 281  53.401  5319    377     3  35        12 Female      No      No
## 282 282  36.142  1852    183     3  33        13 Female      No      No
## 283 283  63.534  8100    581     2  50        17 Female      No     Yes
## 284 284  49.927  6396    485     3  75        17 Female      No     Yes
## 285 285  14.711  2047    167     2  67         6   Male      No     Yes
## 286 286  18.967  1626    156     2  41        11 Female      No     Yes
## 287 287  18.036  1552    142     2  48        15 Female      No      No
## 288 288  60.449  3098    272     4  69         8   Male      No     Yes
## 289 289  16.711  5274    387     3  42        16 Female      No     Yes
## 290 290  10.852  3907    296     2  30         9   Male      No      No
## 291 291  26.370  3235    268     5  78        11   Male      No     Yes
## 292 292  24.088  3665    287     4  56        13 Female      No     Yes
## 293 293  51.532  5096    380     2  31        15   Male      No     Yes
## 294 294 140.672 11200    817     7  46         9   Male      No     Yes
## 295 295  42.915  2532    205     4  42        13   Male      No     Yes
## 296 296  27.272  1389    149     5  67        10 Female      No     Yes
## 297 297  65.896  5140    370     1  49        17 Female      No     Yes
## 298 298  55.054  4381    321     3  74        17   Male      No     Yes
## 299 299  20.791  2672    204     1  70        18 Female      No      No
## 300 300  24.919  5051    372     3  76        11 Female      No     Yes
## 301 301  21.786  4632    355     1  50        17   Male      No     Yes
## 302 302  31.335  3526    289     3  38         7 Female      No      No
## 303 303  59.855  4964    365     1  46        13 Female      No     Yes
## 304 304  44.061  4970    352     1  79        11   Male      No     Yes
## 305 305  82.706  7506    536     2  64        13 Female      No     Yes
## 306 306  24.460  1924    165     2  50        14 Female      No     Yes
## 307 307  45.120  3762    287     3  80         8   Male      No     Yes
## 308 308  75.406  3874    298     3  41        14 Female      No     Yes
## 309 309  14.956  4640    332     2  33         6   Male      No      No
## 310 310  75.257  7010    494     3  34        18 Female      No     Yes
## 311 311  33.694  4891    369     1  52        16   Male     Yes      No
## 312 312  23.375  5429    396     3  57        15 Female      No     Yes
## 313 313  27.825  5227    386     6  63        11   Male      No     Yes
## 314 314  92.386  7685    534     2  75        18 Female      No     Yes
## 315 315 115.520  9272    656     2  69        14   Male      No      No
## 316 316  14.479  3907    296     3  43        16   Male      No     Yes
## 317 317  52.179  7306    522     2  57        14   Male      No      No
## 318 318  68.462  4712    340     2  71        16   Male      No     Yes
## 319 319  18.951  1485    129     3  82        13 Female      No      No
## 320 320  27.590  2586    229     5  54        16   Male      No     Yes
## 321 321  16.279  1160    126     3  78        13   Male     Yes     Yes
## 322 322  25.078  3096    236     2  27        15 Female      No     Yes
## 323 323  27.229  3484    282     6  51        11   Male      No      No
## 324 324 182.728 13913    982     4  98        17   Male      No     Yes
## 325 325  31.029  2863    223     2  66        17   Male     Yes     Yes
## 326 326  17.765  5072    364     1  66        12 Female      No     Yes
## 327 327 125.480 10230    721     3  82        16   Male      No     Yes
## 328 328  49.166  6662    508     3  68        14 Female      No      No
## 329 329  41.192  3673    297     3  54        16 Female      No     Yes
## 330 330  94.193  7576    527     2  44        16 Female      No     Yes
## 331 331  20.405  4543    329     2  72        17   Male     Yes      No
## 332 332  12.581  3976    291     2  48        16   Male      No     Yes
## 333 333  62.328  5228    377     3  83        15   Male      No      No
## 334 334  21.011  3402    261     2  68        17   Male      No     Yes
## 335 335  24.230  4756    351     2  64        15 Female      No     Yes
## 336 336  24.314  3409    270     2  23         7 Female      No     Yes
## 337 337  32.856  5884    438     4  68        13   Male      No      No
## 338 338  12.414   855    119     3  32        12   Male      No     Yes
## 339 339  41.365  5303    377     1  45        14   Male      No      No
## 340 340 149.316 10278    707     1  80        16   Male      No      No
## 341 341  27.794  3807    301     4  35         8 Female      No     Yes
## 342 342  13.234  3922    299     2  77        17 Female      No     Yes
## 343 343  14.595  2955    260     5  37         9   Male      No     Yes
## 344 344  10.735  3746    280     2  44        17 Female      No     Yes
## 345 345  48.218  5199    401     7  39        10   Male      No     Yes
## 346 346  30.012  1511    137     2  33        17   Male      No     Yes
## 347 347  21.551  5380    420     5  51        18   Male      No     Yes
## 348 348 160.231 10748    754     2  69        17   Male      No      No
## 349 349  13.433  1134    112     3  70        14   Male      No     Yes
## 350 350  48.577  5145    389     3  71        13 Female      No     Yes
## 351 351  30.002  1561    155     4  70        13 Female      No     Yes
## 352 352  61.620  5140    374     1  71         9   Male      No     Yes
## 353 353 104.483  7140    507     2  41        14   Male      No     Yes
## 354 354  41.868  4716    342     2  47        18   Male      No      No
## 355 355  12.068  3873    292     1  44        18 Female      No     Yes
## 356 356 180.682 11966    832     2  58         8 Female      No     Yes
## 357 357  34.480  6090    442     3  36        14   Male      No      No
## 358 358  39.609  2539    188     1  40        14   Male      No     Yes
## 359 359  30.111  4336    339     1  81        18   Male      No     Yes
## 360 360  12.335  4471    344     3  79        12   Male      No     Yes
## 361 361  53.566  5891    434     4  82        10 Female      No      No
## 362 362  53.217  4943    362     2  46        16 Female      No     Yes
## 363 363  26.162  5101    382     3  62        19 Female      No      No
## 364 364  64.173  6127    433     1  80        10   Male      No     Yes
## 365 365 128.669  9824    685     3  67        16   Male      No     Yes
## 366 366 113.772  6442    489     4  69        15   Male     Yes     Yes
## 367 367  61.069  7871    564     3  56        14   Male      No     Yes
## 368 368  23.793  3615    263     2  70        14   Male      No      No
## 369 369  89.000  5759    440     3  37         6 Female      No      No
## 370 370  71.682  8028    599     3  57        16   Male      No     Yes
## 371 371  35.610  6135    466     4  40        12   Male      No      No
## 372 372  39.116  2150    173     4  75        15   Male      No      No
## 373 373  19.782  3782    293     2  46        16 Female     Yes      No
## 374 374  55.412  5354    383     2  37        16 Female     Yes     Yes
## 375 375  29.400  4840    368     3  76        18 Female      No     Yes
## 376 376  20.974  5673    413     5  44        16 Female      No     Yes
## 377 377  87.625  7167    515     2  46        10 Female      No      No
## 378 378  28.144  1567    142     3  51        10   Male      No     Yes
## 379 379  19.349  4941    366     1  33        19   Male      No     Yes
## 380 380  53.308  2860    214     1  84        10   Male      No     Yes
## 381 381 115.123  7760    538     3  83        14 Female      No      No
## 382 382 101.788  8029    574     2  84        11   Male      No     Yes
## 383 383  24.824  5495    409     1  33         9   Male     Yes      No
## 384 384  14.292  3274    282     9  64         9   Male      No     Yes
## 385 385  20.088  1870    180     3  76        16   Male      No      No
## 386 386  26.400  5640    398     3  58        15 Female      No      No
## 387 387  19.253  3683    287     4  57        10   Male      No      No
## 388 388  16.529  1357    126     3  62         9   Male      No      No
## 389 389  37.878  6827    482     2  80        13 Female      No      No
## 390 390  83.948  7100    503     2  44        18   Male      No      No
## 391 391 135.118 10578    747     3  81        15 Female      No     Yes
## 392 392  73.327  6555    472     2  43        15 Female      No      No
## 393 393  25.974  2308    196     2  24        10   Male      No      No
## 394 394  17.316  1335    138     2  65        13   Male      No      No
## 395 395  49.794  5758    410     4  40         8   Male      No      No
## 396 396  12.096  4100    307     3  32        13   Male      No     Yes
## 397 397  13.364  3838    296     5  65        17   Male      No      No
## 398 398  57.872  4171    321     5  67        12 Female      No     Yes
## 399 399  37.728  2525    192     1  44        13   Male      No     Yes
## 400 400  18.701  5524    415     5  64         7 Female      No      No
##            Ethnicity Balance        Type
## 1          Caucasian     333  Medium_low
## 2              Asian     903        High
## 3              Asian     580        High
## 4              Asian     964        High
## 5          Caucasian     331 Medium-high
## 6          Caucasian    1151        High
## 7   African American     203  Medium_low
## 8              Asian     872        High
## 9          Caucasian     279  Medium_low
## 10  African American    1350        High
## 11         Caucasian    1407        High
## 12         Caucasian       0         Low
## 13             Asian     204 Medium-high
## 14         Caucasian    1081        High
## 15  African American     148  Medium_low
## 16  African American       0         Low
## 17  African American       0  Medium_low
## 18             Asian     368  Medium_low
## 19             Asian     891        High
## 20             Asian    1048        High
## 21             Asian      89         Low
## 22         Caucasian     968        High
## 23  African American       0         Low
## 24  African American     411 Medium-high
## 25         Caucasian       0         Low
## 26  African American     671  Medium_low
## 27         Caucasian     654  Medium_low
## 28  African American     467  Medium_low
## 29  African American    1809        High
## 30         Caucasian     915 Medium-high
## 31         Caucasian     863 Medium-high
## 32             Asian       0         Low
## 33         Caucasian     526        High
## 34         Caucasian       0         Low
## 35             Asian       0         Low
## 36         Caucasian     419         Low
## 37         Caucasian     762        High
## 38         Caucasian    1093        High
## 39         Caucasian     531  Medium_low
## 40         Caucasian     344  Medium_low
## 41  African American      50  Medium_low
## 42  African American    1155        High
## 43             Asian     385 Medium-high
## 44             Asian     976        High
## 45         Caucasian    1120        High
## 46         Caucasian     997        High
## 47             Asian    1241 Medium-high
## 48         Caucasian     797  Medium_low
## 49             Asian       0         Low
## 50  African American     902  Medium_low
## 51  African American     654 Medium-high
## 52  African American     211  Medium_low
## 53         Caucasian     607 Medium-high
## 54             Asian     957 Medium-high
## 55             Asian       0         Low
## 56             Asian       0         Low
## 57             Asian     379 Medium-high
## 58         Caucasian     133 Medium-high
## 59         Caucasian     333  Medium_low
## 60         Caucasian     531 Medium-high
## 61             Asian     631 Medium-high
## 62         Caucasian     108  Medium_low
## 63         Caucasian       0         Low
## 64         Caucasian     133         Low
## 65  African American       0         Low
## 66         Caucasian     602  Medium_low
## 67             Asian    1388        High
## 68  African American     889 Medium-high
## 69         Caucasian     822 Medium-high
## 70         Caucasian    1084        High
## 71         Caucasian     357  Medium_low
## 72             Asian    1103        High
## 73             Asian     663 Medium-high
## 74         Caucasian     601  Medium_low
## 75         Caucasian     945 Medium-high
## 76  African American      29  Medium_low
## 77         Caucasian     532  Medium_low
## 78             Asian     145  Medium_low
## 79         Caucasian     391        High
## 80             Asian       0         Low
## 81         Caucasian     162  Medium_low
## 82         Caucasian      99  Medium_low
## 83         Caucasian     503  Medium_low
## 84         Caucasian       0         Low
## 85         Caucasian       0         Low
## 86             Asian    1779        High
## 87         Caucasian     815        High
## 88             Asian       0         Low
## 89  African American     579  Medium_low
## 90  African American    1176        High
## 91  African American    1023 Medium-high
## 92         Caucasian     812        High
## 93         Caucasian       0         Low
## 94  African American     937 Medium-high
## 95  African American       0         Low
## 96  African American       0         Low
## 97             Asian    1380 Medium-high
## 98  African American     155  Medium_low
## 99  African American     375         Low
## 100        Caucasian    1311        High
## 101        Caucasian     298  Medium_low
## 102        Caucasian     431         Low
## 103        Caucasian    1587        High
## 104        Caucasian    1050        High
## 105        Caucasian     745 Medium-high
## 106        Caucasian     210  Medium_low
## 107            Asian       0         Low
## 108            Asian       0  Medium_low
## 109        Caucasian     227        High
## 110            Asian     297  Medium_low
## 111            Asian      47  Medium_low
## 112 African American       0         Low
## 113        Caucasian    1046        High
## 114            Asian     768        High
## 115        Caucasian     271  Medium_low
## 116 African American     510 Medium-high
## 117        Caucasian       0         Low
## 118            Asian    1341        High
## 119        Caucasian       0         Low
## 120        Caucasian       0         Low
## 121            Asian       0         Low
## 122        Caucasian     454        High
## 123        Caucasian     904        High
## 124 African American       0         Low
## 125        Caucasian       0         Low
## 126        Caucasian       0         Low
## 127            Asian    1404 Medium-high
## 128        Caucasian       0  Medium_low
## 129 African American    1259        High
## 130 African American     255  Medium_low
## 131 African American     868  Medium_low
## 132            Asian       0         Low
## 133 African American     912 Medium-high
## 134 African American    1018 Medium-high
## 135 African American     835 Medium-high
## 136 African American       8  Medium_low
## 137 African American      75  Medium_low
## 138            Asian     187  Medium_low
## 139        Caucasian       0         Low
## 140 African American    1597        High
## 141 African American    1425        High
## 142        Caucasian     605        High
## 143            Asian     669        High
## 144 African American     710 Medium-high
## 145        Caucasian      68         Low
## 146            Asian     642 Medium-high
## 147        Caucasian     805 Medium-high
## 148        Caucasian       0         Low
## 149        Caucasian       0         Low
## 150            Asian       0         Low
## 151 African American     581 Medium-high
## 152        Caucasian     534 Medium-high
## 153        Caucasian     156         Low
## 154        Caucasian       0  Medium_low
## 155 African American       0  Medium_low
## 156            Asian       0         Low
## 157        Caucasian     429  Medium_low
## 158        Caucasian    1020 Medium-high
## 159            Asian     653 Medium-high
## 160            Asian       0         Low
## 161        Caucasian     836 Medium-high
## 162        Caucasian       0         Low
## 163        Caucasian    1086        High
## 164 African American       0         Low
## 165            Asian     548        High
## 166        Caucasian     570  Medium_low
## 167 African American       0         Low
## 168        Caucasian       0         Low
## 169            Asian       0         Low
## 170        Caucasian    1099        High
## 171            Asian       0         Low
## 172        Caucasian     283  Medium_low
## 173            Asian     108 Medium-high
## 174 African American     724 Medium-high
## 175 African American    1573        High
## 176        Caucasian       0         Low
## 177        Caucasian       0         Low
## 178            Asian     384  Medium_low
## 179        Caucasian     453  Medium_low
## 180        Caucasian    1237        High
## 181            Asian     423  Medium_low
## 182 African American     516 Medium-high
## 183 African American     789        High
## 184            Asian       0  Medium_low
## 185        Caucasian    1448        High
## 186 African American     450  Medium_low
## 187 African American     188  Medium_low
## 188            Asian       0         Low
## 189 African American     930        High
## 190        Caucasian     126  Medium_low
## 191        Caucasian     538 Medium-high
## 192            Asian    1687        High
## 193            Asian     336  Medium_low
## 194        Caucasian    1426        High
## 195 African American       0         Low
## 196 African American     802 Medium-high
## 197 African American     749        High
## 198 African American      69  Medium_low
## 199 African American       0         Low
## 200            Asian     571  Medium_low
## 201 African American     829 Medium-high
## 202        Caucasian    1048        High
## 203        Caucasian       0         Low
## 204 African American    1411        High
## 205        Caucasian     456 Medium-high
## 206        Caucasian     638  Medium_low
## 207        Caucasian       0         Low
## 208        Caucasian    1216  Medium_low
## 209 African American     230  Medium_low
## 210 African American     732        High
## 211        Caucasian      95  Medium_low
## 212        Caucasian     799 Medium-high
## 213        Caucasian     308  Medium_low
## 214 African American     637 Medium-high
## 215        Caucasian     681 Medium-high
## 216        Caucasian     246  Medium_low
## 217            Asian      52         Low
## 218        Caucasian     955 Medium-high
## 219            Asian     195         Low
## 220        Caucasian     653 Medium-high
## 221            Asian    1246 Medium-high
## 222 African American    1230        High
## 223        Caucasian    1549        High
## 224            Asian     573  Medium_low
## 225        Caucasian     701        High
## 226            Asian    1075 Medium-high
## 227        Caucasian    1032        High
## 228        Caucasian     482 Medium-high
## 229 African American     156 Medium-high
## 230            Asian    1058        High
## 231 African American     661 Medium-high
## 232        Caucasian     657 Medium-high
## 233            Asian     689 Medium-high
## 234        Caucasian       0         Low
## 235 African American    1329        High
## 236 African American     191         Low
## 237            Asian     489 Medium-high
## 238            Asian     443 Medium-high
## 239        Caucasian      52         Low
## 240            Asian     163  Medium_low
## 241            Asian     148  Medium_low
## 242 African American       0         Low
## 243            Asian      16         Low
## 244        Caucasian     856        High
## 245        Caucasian       0         Low
## 246 African American       0         Low
## 247            Asian     199  Medium_low
## 248        Caucasian       0         Low
## 249        Caucasian       0         Low
## 250        Caucasian      98         Low
## 251            Asian       0         Low
## 252        Caucasian     132  Medium_low
## 253        Caucasian    1355        High
## 254 African American     218 Medium-high
## 255        Caucasian    1048        High
## 256 African American     118  Medium_low
## 257            Asian       0         Low
## 258            Asian       0         Low
## 259        Caucasian       0         Low
## 260        Caucasian    1092        High
## 261            Asian     345 Medium-high
## 262            Asian    1050        High
## 263        Caucasian     465  Medium_low
## 264 African American     133  Medium_low
## 265        Caucasian     651 Medium-high
## 266 African American     549 Medium-high
## 267        Caucasian      15 Medium-high
## 268            Asian     942 Medium-high
## 269        Caucasian       0         Low
## 270        Caucasian     772 Medium-high
## 271        Caucasian     136         Low
## 272        Caucasian     436 Medium-high
## 273            Asian     728  Medium_low
## 274            Asian    1255 Medium-high
## 275 African American     967 Medium-high
## 276        Caucasian     529        High
## 277        Caucasian     209  Medium_low
## 278        Caucasian     531 Medium-high
## 279        Caucasian     250        High
## 280        Caucasian     269         Low
## 281 African American     541 Medium-high
## 282 African American       0         Low
## 283        Caucasian    1298        High
## 284        Caucasian     890        High
## 285        Caucasian       0         Low
## 286            Asian       0         Low
## 287        Caucasian       0         Low
## 288        Caucasian       0  Medium_low
## 289            Asian     863 Medium-high
## 290        Caucasian     485  Medium_low
## 291            Asian     159  Medium_low
## 292        Caucasian     309  Medium_low
## 293        Caucasian     481 Medium-high
## 294 African American    1677        High
## 295            Asian       0         Low
## 296        Caucasian       0         Low
## 297        Caucasian     293 Medium-high
## 298            Asian     188  Medium_low
## 299 African American       0         Low
## 300 African American     711 Medium-high
## 301        Caucasian     580 Medium-high
## 302        Caucasian     172  Medium_low
## 303        Caucasian     295 Medium-high
## 304 African American     414 Medium-high
## 305            Asian     905        High
## 306            Asian       0         Low
## 307        Caucasian      70  Medium_low
## 308            Asian       0  Medium_low
## 309            Asian     681 Medium-high
## 310        Caucasian     885        High
## 311 African American    1036 Medium-high
## 312        Caucasian     844 Medium-high
## 313        Caucasian     823 Medium-high
## 314            Asian     843        High
## 315 African American    1140        High
## 316        Caucasian     463  Medium_low
## 317            Asian    1142        High
## 318        Caucasian     136 Medium-high
## 319        Caucasian       0         Low
## 320 African American       0         Low
## 321 African American       5         Low
## 322        Caucasian      81  Medium_low
## 323        Caucasian     265  Medium_low
## 324        Caucasian    1999        <NA>
## 325            Asian     415         Low
## 326        Caucasian     732 Medium-high
## 327        Caucasian    1361        High
## 328            Asian     984        High
## 329        Caucasian     121  Medium_low
## 330        Caucasian     846        High
## 331            Asian    1054  Medium_low
## 332        Caucasian     474  Medium_low
## 333        Caucasian     380 Medium-high
## 334 African American     182  Medium_low
## 335        Caucasian     594 Medium-high
## 336        Caucasian     194  Medium_low
## 337        Caucasian     926        High
## 338 African American       0         Low
## 339        Caucasian     606 Medium-high
## 340 African American    1107        High
## 341 African American     320  Medium_low
## 342        Caucasian     426  Medium_low
## 343 African American     204         Low
## 344        Caucasian     410  Medium_low
## 345            Asian     633 Medium-high
## 346        Caucasian       0         Low
## 347            Asian     907 Medium-high
## 348        Caucasian    1192        High
## 349        Caucasian       0         Low
## 350            Asian     503 Medium-high
## 351        Caucasian       0         Low
## 352        Caucasian     302 Medium-high
## 353 African American     583        High
## 354        Caucasian     425 Medium-high
## 355            Asian     413  Medium_low
## 356 African American    1405        High
## 357        Caucasian     962        High
## 358            Asian       0         Low
## 359        Caucasian     347  Medium_low
## 360 African American     611  Medium_low
## 361        Caucasian     712        High
## 362            Asian     382 Medium-high
## 363 African American     710 Medium-high
## 364        Caucasian     578        High
## 365            Asian    1243        High
## 366        Caucasian     790        High
## 367        Caucasian    1264        High
## 368 African American     216  Medium_low
## 369        Caucasian     345 Medium-high
## 370        Caucasian    1208        High
## 371        Caucasian     992        High
## 372        Caucasian       0         Low
## 373        Caucasian     840  Medium_low
## 374        Caucasian    1003 Medium-high
## 375        Caucasian     588 Medium-high
## 376        Caucasian    1000 Medium-high
## 377 African American     767        High
## 378        Caucasian       0         Low
## 379        Caucasian     717 Medium-high
## 380        Caucasian       0         Low
## 381 African American     661        High
## 382        Caucasian     849        High
## 383        Caucasian    1352 Medium-high
## 384        Caucasian     382  Medium_low
## 385 African American       0         Low
## 386            Asian     905 Medium-high
## 387 African American     371  Medium_low
## 388            Asian       0         Low
## 389        Caucasian    1129        High
## 390        Caucasian     806        High
## 391            Asian    1393        High
## 392        Caucasian     721        High
## 393            Asian       0         Low
## 394 African American       0         Low
## 395        Caucasian     734 Medium-high
## 396        Caucasian     560  Medium_low
## 397 African American     480  Medium_low
## 398        Caucasian     138  Medium_low
## 399        Caucasian       0         Low
## 400            Asian     966 Medium-high
Credit <- 
  Credit %>%  
  select(Balance, Limit, Income, Rating, Age)

# Fit the model
Balance_model <- lm(data = Credit, Balance ~ Limit + Income)
Balance_model %>% get_regression_table()
## # A tibble: 3 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept -385.       19.5       -19.8       0 -423.    -347.   
## 2 Limit        0.264     0.006      45.0       0    0.253    0.276
## 3 Income      -7.66      0.385     -19.9       0   -8.42    -6.91
reg_points <- Balance_model %>% get_regression_points()

# Plots
gf_point(data = reg_points, residual ~ Limit)

gf_point(data = reg_points, residual ~ Income)

gf_histogram(data = reg_points, ~residual,binwidth = 20, color = "white")

There appears to be a pattern to the residuals and the histogram is also right skewed. Hence there are many residuals that have a negative value and hence the model is overestimating the credit card balance by a large amount.

11.1.3 Problem LC11.4 Credit Card Balance vs Rating and Age

Balance vs Rating and Age

model_balance_2 <- lm(Balance~ Rating + Age, data = Credit)

model_balance_2 %>% get_regression_table()
## # A tibble: 3 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept  -270.      44.8       -6.02       0  -358.    -181.  
## 2 Rating        2.59     0.074     34.8        0     2.45     2.74
## 3 Age          -2.35     0.668     -3.52       0    -3.66    -1.04
reg_points <- model_balance_2 %>% 
  get_regression_points()
reg_points
## # A tibble: 400 x 6
##       ID Balance Rating   Age Balance_hat residual
##    <int>   <int>  <int> <int>       <dbl>    <dbl>
##  1     1     333    283    34        384.    -51.4
##  2     2     903    483    82        790.    113. 
##  3     3     580    514    71        896.   -316. 
##  4     4     964    681    36       1412.   -448. 
##  5     5     331    357    68        496.   -165. 
##  6     6    1151    569    77       1025.    126. 
##  7     7     203    259    37        315.   -112. 
##  8     8     872    512    87        854.     18.3
##  9     9     279    266    66        265.     13.9
## 10    10    1350    491    41        907.    443. 
## # … with 390 more rows
# Plots
gf_histogram(~residual, binwidths = 10, color = "white", data = reg_points)

gf_point(residual~ Rating, data = reg_points)

gf_point(residual~ Age, data = reg_points)

Histogram is near bell shaped and symmetric around zero, so model appears to be good there.

Residuals vs Age is also adequately randomly distributed.

Residuals vs Rating seems to have some patterns, where the residuals for those with higher rating is always negative, and those in the lower range have a consistently positive residual. Perhaps the model is not able to relate Rating and Balance in a linear manner.

11.1.3.1 Teacher Scores one last time

evals_ch7 <- evals %>% select(score, age, gender)
score_model_2 <- lm(score ~ age + gender, data = evals_ch7)

score_model_2 %>% get_regression_table()
## # A tibble: 3 x 7
##   term       estimate std_error statistic p_value lower_ci upper_ci
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept     4.48      0.125     35.8    0        4.24     4.73 
## 2 age          -0.009     0.003     -3.28   0.001   -0.014   -0.003
## 3 gendermale    0.191     0.052      3.63   0        0.087    0.294
reg_points <- score_model_2 %>% 
  get_regression_points()
reg_points
## # A tibble: 463 x 6
##       ID score   age gender score_hat residual
##    <int> <dbl> <int> <fct>      <dbl>    <dbl>
##  1     1   4.7    36 female      4.17    0.528
##  2     2   4.1    36 female      4.17   -0.072
##  3     3   3.9    36 female      4.17   -0.272
##  4     4   4.8    36 female      4.17    0.628
##  5     5   4.6    59 male        4.16    0.437
##  6     6   4.3    59 male        4.16    0.137
##  7     7   2.8    59 male        4.16   -1.36 
##  8     8   4.1    51 male        4.23   -0.132
##  9     9   3.4    51 male        4.23   -0.832
## 10    10   4.5    40 female      4.14    0.363
## # … with 453 more rows
# Plots
gf_histogram(~residual | gender, binwidth = 0.25, color = "white",data = reg_points)

gf_point(residual ~ age | gender, data = reg_points)

Histograms for residual are skewed left for both genders, so the model is underestimating scores.

Plot of residual vs age seems random enough.

12 Thinking with Data

12.1 Case Study 1: Seattle House Prices

The dataset consists 21,613 houses and 21 variables describing these houses; for a full list of these variables see the help file by running ?house_prices in the console. In this case study, we’ll create a model using multiple regression where: - The outcome variable y is the sale price of houses
The two explanatory/predictor variables we’ll use are : - x1 : house size sqft_living, as measured by square feet of living space, where 1 square foot is about 0.09 square meters.
- x2: house condition, a categorical variable with 5 levels where 1 indicates “poor” and 5 indicates “excellent.”

12.1.1 EDA

house_prices %>% skim()
Data summary
Name Piped data
Number of rows 21613
Number of columns 21
_______________________
Column type frequency:
character 1
Date 1
factor 3
logical 1
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 10 10 0 21436 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2014-05-02 2015-05-27 2014-10-16 372

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
condition 0 1 FALSE 5 3: 14031, 4: 5679, 5: 1701, 2: 172
grade 0 1 FALSE 12 7: 8981, 8: 6068, 9: 2615, 6: 2038
zipcode 0 1 FALSE 70 981: 602, 980: 590, 981: 583, 980: 574

Variable type: logical

skim_variable n_missing complete_rate mean count
waterfront 0 1 0.01 FAL: 21450, TRU: 163

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
price 0 1 540088.14 367127.20 75000.00 321950.00 450000.00 645000.00 7700000.00 ▇▁▁▁▁
bedrooms 0 1 3.37 0.93 0.00 3.00 3.00 4.00 33.00 ▇▁▁▁▁
bathrooms 0 1 2.11 0.77 0.00 1.75 2.25 2.50 8.00 ▃▇▁▁▁
sqft_living 0 1 2079.90 918.44 290.00 1427.00 1910.00 2550.00 13540.00 ▇▂▁▁▁
sqft_lot 0 1 15106.97 41420.51 520.00 5040.00 7618.00 10688.00 1651359.00 ▇▁▁▁▁
floors 0 1 1.49 0.54 1.00 1.00 1.50 2.00 3.50 ▇▅▁▁▁
view 0 1 0.23 0.77 0.00 0.00 0.00 0.00 4.00 ▇▁▁▁▁
sqft_above 0 1 1788.39 828.09 290.00 1190.00 1560.00 2210.00 9410.00 ▇▃▁▁▁
sqft_basement 0 1 291.51 442.58 0.00 0.00 0.00 560.00 4820.00 ▇▁▁▁▁
yr_built 0 1 1971.01 29.37 1900.00 1951.00 1975.00 1997.00 2015.00 ▂▃▇▇▇
yr_renovated 0 1 84.40 401.68 0.00 0.00 0.00 0.00 2015.00 ▇▁▁▁▁
lat 0 1 47.56 0.14 47.16 47.47 47.57 47.68 47.78 ▁▃▅▇▇
long 0 1 -122.21 0.14 -122.52 -122.33 -122.23 -122.12 -121.32 ▇▇▂▁▁
sqft_living15 0 1 1986.55 685.39 399.00 1490.00 1840.00 2360.00 6210.00 ▅▇▂▁▁
sqft_lot15 0 1 12768.46 27304.18 651.00 5100.00 7620.00 10083.00 871200.00 ▇▁▁▁▁
house_prices %>% glimpse()
## Rows: 21,613
## Columns: 21
## $ id            <chr> "7129300520", "6414100192", "5631500400", "2487200875",…
## $ date          <date> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-0…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500…
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2…
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 18…
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, …
## $ waterfront    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0…
## $ condition     <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4…
## $ grade         <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, …
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 18…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0,…
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2…
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ zipcode       <fct> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198,…
## $ lat           <dbl> 47.511, 47.721, 47.738, 47.521, 47.617, 47.656, 47.310,…
## $ long          <dbl> -122.26, -122.32, -122.23, -122.39, -122.05, -122.00, -…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2…
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,…
gf_bar(~condition, data = house_prices, title ="House Condition")

gf_histogram(~price, color = "white", data = house_prices, title = "House Prices")

gf_histogram(~sqft_living, data = house_prices, title = "House Size")

Most houses are <5000 sqft, valued at USD 2,000,000or less and in grades3/4/5`. Though there are outliers at upto USD 8,000,000.

house_prices %>% 
  summarise(mean_price = mean(price),
            median_price = median(price),
            sd_price = sd(price),
            IQR_price = IQR(price))
## # A tibble: 1 x 4
##   mean_price median_price sd_price IQR_price
##        <dbl>        <dbl>    <dbl>     <dbl>
## 1    540088.       450000  367127.    323050

Median is lower than mean, since this is a right skewed distribution

house_prices %>% 
  summarise(mean_sqft_living = mean(sqft_living),
            median_sqft_living = median(sqft_living),
            sd_sqft_living = sd(sqft_living),
            IQR_sqft_living = IQR(sqft_living))
## # A tibble: 1 x 4
##   mean_sqft_living median_sqft_living sd_sqft_living IQR_sqft_living
##              <dbl>              <int>          <dbl>           <dbl>
## 1            2080.               1910           918.            1123
# price and sqft vs categorical variables such as condition

house_prices <- 
  house_prices %>% 
  mutate(log10price = log10(price), log10size = log10(sqft_living))

gf_point(log10price~log10size, color = ~condition, alpha = 0.1,data = house_prices,title = "House Prices in Seatle", subtitle = "Linear Model with Interaction", caption = "Sleepless also..") %>% 
  gf_smooth(method = "lm")

# Can also look at facetted plots. Choose one of these and own it, says Chester Ismay

gf_point(log10price~log10size |condition, color = ~condition, alpha = 0.1,data = house_prices,title = "House Prices in Seatle", subtitle = "Linear Model with Interaction", caption = "Sleepless also...") %>% 
  gf_smooth(method = "lm")

Houses with condition = 5 have the steepest increase in price vs size.

12.1.2 Regression Modelling

price_interaction <- lm(log10price ~ log10size*condition, data = house_prices)
price_interaction %>% get_regression_table()
## # A tibble: 10 x 7
##    term                 estimate std_error statistic p_value lower_ci upper_ci
##    <chr>                   <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
##  1 intercept               3.33      0.451     7.38    0        2.45     4.22 
##  2 log10size               0.69      0.148     4.65    0        0.399    0.98 
##  3 condition2              0.047     0.498     0.094   0.925   -0.93     1.02 
##  4 condition3             -0.367     0.452    -0.812   0.417   -1.25     0.519
##  5 condition4             -0.398     0.453    -0.879   0.38    -1.29     0.49 
##  6 condition5             -0.883     0.457    -1.93    0.053   -1.78     0.013
##  7 log10size:condition2   -0.024     0.163    -0.148   0.882   -0.344    0.295
##  8 log10size:condition3    0.133     0.148     0.893   0.372   -0.158    0.424
##  9 log10size:condition4    0.146     0.149     0.979   0.328   -0.146    0.437
## 10 log10size:condition5    0.31      0.15      2.07    0.039    0.016    0.604

Interpreting these models we have:

\[ In\ general \\ \widehat{log10price} = \beta_0 + \beta_1*log10size + \beta_2 * condition + \\ \beta_3 * log10size * condition \\\ where\\ condition\, is\, a\, binary\, variable \] Which can be separately written for each condition as \[ \begin{cases} \widehat{log10price} & \text{= 3.33+ 0.69 * log10size }\ { if }\ Condition = 1 \\ & \text{= (3.33+0.047) + (0.69--0.024)*log10size}\ { if }\ Condition =2 \\ & \text{= (3.33--0.367) + (0.69+0.133)*log10size}\ { if }\ Condition=3 \\ & \text{= (3.33--0.398) + (0.69+0.146)*log10size}\ { if }\ Condition=4 \\ & \text{= (3.33--0.883) + (0.69 +0.310)*log10size}\ { if }\ Condition=5 \end{cases} \]

12.1.3 Making Predictions

We want to predict the price for : House: Condition 5; Area = 1900sqft

gf_point(log10price ~ log10size, color = ~ condition, alpha = 0.1,data = house_prices,title = "House Prices in Seattle", subtitle = "Linear Model with Interaction", caption = "Sleepless also..") %>% 
  gf_smooth(method = "lm") %>% 
  gf_vline(xintercept = ~ log10(1900),linetype = "dashed")

# Using the Condition 5 model we have
log10predprice <- (3.33 - 0.883) + (0.69 + 0.31)*log10(1900)
log10predprice
## [1] 5.7258
Predicted_Price = 10^(log10predprice)
Predicted_Price
## [1] 531806

12.1.4 Problem LC12.1 Housing Price vs Area - Parallel slope model

We want to predict the price for : House: Condition 5; Area = 1900sqft

price_parallel_model <- lm(log10price ~ log10size + condition, data = house_prices)
get_regression_table(price_parallel_model)
## # A tibble: 6 x 7
##   term       estimate std_error statistic p_value lower_ci upper_ci
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept     2.88      0.036     80.0    0        2.81     2.95 
## 2 log10size     0.837     0.006    134.     0        0.825    0.85 
## 3 condition2   -0.039     0.033     -1.16   0.246   -0.104    0.027
## 4 condition3    0.032     0.031      1.04   0.3     -0.028    0.092
## 5 condition4    0.044     0.031      1.42   0.155   -0.017    0.104
## 6 condition5    0.096     0.031      3.09   0.002    0.035    0.156

The parallel slope models are: \[ \widehat{log10price} = \begin{cases} & \text{ 2.882 + 0.837 * log10size }\ { if }\ Condition=1 \\ & \text{ (2.882 -- 0.039) + 0.837 * log10size }\ { if }\ Condition=2 \\ & \text{ (2.882 + 0.032) + 0.837 * log10size }\ { if }\ Condition=3 \\ & \text{ (2.882 + 0.044) + 0.837 * log10size }\ { if }\ Condition=4 \\ & \text{ (2.882 + 0.096) + 0.837 * log10size }\ { if }\ Condition=5 \end{cases} \]

# We want to predict the price for :
# House: Condition 5; Area = 1900sqft

house_prices %>% 
  gf_point(log10price ~ log10size, color = ~ condition, alpha = 0.2, title = "Seattle House Prices",subtitle = "Parallel Slope Regression Model") %>% 
  gf_abline(intercept = ~ 2.882, slope = ~0.837, color = ~ condition, data = filter(house_prices, condition == "1")) %>% 
  gf_abline(intercept = ~ (2.882-0.039), slope = ~0.837, color = ~ condition, data = filter(house_prices, condition == "2")) %>% 
  gf_abline(intercept = ~ (2.882 + 0.032), slope = ~0.837, color = ~ condition, data = filter(house_prices, condition == "3")) %>% 
  gf_abline(intercept = ~ (2.882 +0.044), slope = ~0.837, color = ~ condition, data = filter(house_prices, condition == "4")) %>%
  gf_abline(intercept = ~ (2.882 + 0.096), slope = ~0.837, color = ~ condition, data = filter(house_prices, condition == "5")) %>% 
  gf_vline(xintercept = ~ log10(1900), linetype = "dashed")

# Using the Condition 5 model we have
log10predprice <- (2.882+0.096) + (0.837)*log10(1900)
Predicted_Price = 10^(log10predprice)
Predicted_Price
## [1] 527615